Predicting the sub-cellular location of proteins from text using support vector machines.

B. J. Stapley, L. A. Kelley, M. J. Sternberg

    Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review


    We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S. cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.
    Original languageEnglish
    Title of host publicationPacific Symposium on Biocomputing 2002 : Kauai, Hawaii, USA, 3-7 January 2002
    PublisherWorld Scientific Publishing Co
    Number of pages11
    ISBN (Print)981024777X
    Publication statusPublished - 2002


    Dive into the research topics of 'Predicting the sub-cellular location of proteins from text using support vector machines.'. Together they form a unique fingerprint.

    Cite this