Classifying protein fingerprints

Melanie Hilario, Alex Mitchell, Jee Hyub Kim, Paul Bradley, Terri Attwood

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Protein fingerprints are groups of conserved motifs which can be used as diagnostic signatures to identify and characterize collections of protein sequences. These fingerprints are stored in the PRINTS database after time-consuming annotation by domain experts who must first of all determine the fingerprint type, i.e., whether a fingerprint depicts a protein family, superfamily or domain. To alleviate the annotation bottleneck, a system called PRECIS has been developed which automatically generates PRINTS records, provisionally stored in a supplement called prePRINTS. One limitation of PRECIS is that its classification heuristics, handcoded by proteomics experts, often misclassify fingerprint type; their error rate has been estimated at 40%. This paper reports on an attempt to build more accurate classifiers based on information drawn from the fingerprints themselves and from the SWISS-PROT database. Extensive experimentation using 10-fold cross-validation led to the selection of a model combining the ReliefF feature selector with an SVM-RBF learner. The final model's error rate was estimated at 14.1% on a blind test set, representing a 26% accuracy gain over PRECIS' handcrafted rules. © Springer-Verlag Berlin Heidelberg 2004.
    Original languageEnglish
    Pages (from-to)197-208
    Number of pages11
    JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume3202
    Publication statusPublished - 2004

    Fingerprint

    Dive into the research topics of 'Classifying protein fingerprints'. Together they form a unique fingerprint.

    Cite this