Improving the performance of dictionary-based approaches in protein name recognition

Yoshimasa Tsuruoka, Jun'Ichi Tsujii

    Research output: Contribution to journalArticlepeer-review


    Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%. © 2004 Elsevier Inc. All rights reserved.
    Original languageEnglish
    Pages (from-to)461-470
    Number of pages9
    JournalJournal of Biomedical Informatics
    Issue number6
    Publication statusPublished - Dec 2004


    • Approximate string search
    • Naive Bayes classifier
    • Protein name recognition
    • Spelling variant generator


    Dive into the research topics of 'Improving the performance of dictionary-based approaches in protein name recognition'. Together they form a unique fingerprint.

    Cite this