Abstract
The ability to predict protein function from amino acid sequence is a central research goal of molecular biology. Such a capability would greatly aid the biological interpretation of the genomic data and accelerate its medical exploitation. For the existing sequenced genomes function can be assigned to typically only between 40-60% of the genes [4,8,12,7]. The new science of functional genomics is dedicated to discovering the function of these genes, and to further detailing gene function [10,27,17,6]. Here we present a novel data-mining [24,18] approach to predicting protein functional class from sequence. We demonstrate the effectiveness of this approach on the Mycobacterium tuberculosis [8] genome. Biologically interpretable rules are identified that can predict protein function even in the absence of identifiable sequence homology. These rules predict 65% of the genes with no previous assigned function in Mycobacterium tuberculosis (the bacteria which causes TB) with an estimated accuracy of 60-80% (depending on the level of functional assignment). The rules give insight into the evolutionary history of the organism.
Original language | English |
---|---|
Title of host publication | Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining|Proc. 6th ACM SIGKDD Intern. Conf. Knowl. Disco. Data Mining |
Editors | R. Ramakrishnan, S. Stolfo, R. Bayardo, I. Parsa |
Pages | 384-389 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 2000 |
Event | Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001) - Boston, MA Duration: 1 Jul 2000 → … |
Conference
Conference | Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001) |
---|---|
City | Boston, MA |
Period | 1/07/00 → … |
Keywords
- Biology and genetics
- Concept learning
- Data mining