|
BMC Bioinformatics 2005
Predicting functional sites with an automated algorithm suitable for heterogeneous datasetsAbstract: In this report, we present an algorithmic approach that determines thresholds without human subjectivity. The approach relies on significant raw data preprocessing to improve signal detection. Subsequently, Partition Around Medoids Clustering (PAMC) of the similarity scores assesses sequence fragments where functional annotation remains in question. The accuracy of the approach is confirmed through comparisons to our previous (manual) results and structural analyses. Triosephosphate isomerase and arginyl-tRNA synthetase are discussed as exemplar cases. A quantitative functional site prediction assessment algorithm indicates that the phylogenetic motif predictions, which require sequence information only, are nearly as good as those from evolutionary trace methods that do incorporate structure.The automated threshold detection algorithm has been incorporated into MINER, our web-based phylogenetic motif identification server. MINER is freely available on the web at http://www.pmap.csupomona.edu/MINER/ webcite. Pre-calculated functional site predictions of the COG database and an implementation of the threshold detection algorithm, in the R statistical language, can also be accessed at the website.Due to the exponential growth of genomic and protein sequence data, development of automated strategies for large scale functional site identification is an important post-genomic challenge. Many recent efforts predict functional sites from sequence alone. Strong candidates for functional sites include individual highly conserved positions within a sequence alignment and highly conserved sequence motifs [1-5]. Although attractive due to their relative simplicity, conservation-based approaches frequently result in too many false positives to be satisfactory [3]. Further, sequence regions with significant variability can also be functionally important [6], especially when their composition may define sub-family functional specificity. The Evolutionary Trace (ET) procedure [7],
|