|
BMC Bioinformatics 2007
Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machinesAbstract: We propose a method, named Parepro (Predicting the amino acid replacement probability), to identify nsSNPs having either deleterious or neutral effects on the resulting protein function. Two independent datasets, HumVar and NewHumVar, taken from the PhD-SNP server, were applied to train the model and test the robustness of Parepro. Using a 20-fold cross validation test on the HumVar dataset, Parepro achieved a Matthews correlation coefficient (MCC) of 50% and an overall accuracy (Q2) of 76%, both of which were higher than those predicted by the methods, such as PolyPhen, SIFT, and HydridMeth. Further analysis on an additional dataset (NewHumVar) using Parepro yielded similar results.The performance of Parepro indicates that it is a powerful tool for predicting the effect of nsSNPs on protein function and would be useful for large-scale analysis of genomic nsSNP data.Almost 90% of human genetic variations result from single nucleotide polymorphisms (SNPs) [1]. Among SNPs resulting in amino acid changes, non-synonymous SNPs (nsSNPs) are an important source of individual variation and can result in inherited diseases and drug sensitivity [2-4]. Therefore, the identification of nsSNPs that affect protein function and relate to disease will be a challenge in the coming years [3,5-8].A variety of methods have been developed to identify whether an nsSNP is detrimental to protein function in vitro. Most of these methods utilize evolutionary data [3,8-17], protein structure information [2,18,19], or both [2,7,20-22]. Ng and Henikoff [8,16,23] developed the software SIFT (Sorting Intolerant from Tolerant) to predict the effect of nsSNPs on protein function; SIFT is based on sequence conservation and scores from position-specific scoring matrices. Some studies [24-26] have used phylogenetics to identify functionally critical residues within a protein. The MAPP (Multivariate Analysis of Protein Polymorphism) [18] software exploits the physicochemical variation between wild-type
|