The value of an in-domain lexicon in genomics QA

    Research output: Contribution to journalArticlepeer-review


    This paper demonstrates that a large-scale lexicon tailored for the biology domain is effective in improving question analysis for genomics Question Answering (QA). We use the TREC Genomics Track data to evaluate the performance of different question analysis methods. It is hard to process textual information in biology, especially in molecular biology, due to a huge number of technical terms which rarely appear in general English documents and dictionaries. To support biological Text Mining, we have developed a domain-specific resource, the BioLexicon. Started in 2006 from scratch, this lexicon currently includes more than four million biomedical terms consisting of newly curated terms and terms collected from existing biomedical databases. While conventional genomics QA systems provide query expansion based on thesauri and dictionaries, it is not clear to what extent a biology-oriented lexical resource is effective for question pre-processing for genomics QA. Experiments on the genomics QA data set show that question analysis using the BioLexicon performs slightly better than that using n-grams and the UMLS Specialist Lexicon. © 2010 Imperial College Press.
    Original languageEnglish
    Pages (from-to)147-161
    Number of pages14
    JournalJournal of Bioinformatics and Computational Biology
    Issue number1
    Publication statusPublished - Feb 2010


    • BioLexicon
    • Genomics IR
    • Genomics QA


    Dive into the research topics of 'The value of an in-domain lexicon in genomics QA'. Together they form a unique fingerprint.

    Cite this