|
BMC Bioinformatics 2006
Association algorithm to mine the rules that govern enzyme definition and to classify protein sequencesAbstract: There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions.The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.The number of sequences generated by many genome projects is soaring exponentially but most of them have not been characterized experimentally. Manual annotation methods have been proposed by experts and are popular for use at the genome centers, but their annotation capacities are exceeded by the fast growing genome data. An automatic annotation scheme is in urgent need to speed up reliable functional annotation on new sequences produced. Automatic annotation provides an efficient procedure for analyzing the gene sequences. Most automatic solutions used to characterize the gene sequences are based on a high-level sequence similarity search against some known protein databases such as using the BLAST or FASTA program. The correlation between sequence composition and functional characterization provides the foundation for transferring functional knowledge from a biochemically characterized protein to a homologous but uncharacterized one. However, sequence composition bias and database updating commonly influence the results of similarity searches, and they do not yield the exact share between biological function
|