Background With the recent growth of information on sequence variations in the human genome, predictions regarding the functional effects and relevance to disease phenotypes of coding sequence variations are becoming increasingly important. The aims of this study were to catalog protein-coding sequence variations (CVs) occurring in genetic variation databases and to use bioinformatic programs to analyze CVs. In addition, we aim to provide insight into the functionality of the reference databases. Methodology and Findings To catalog CVs on a genome-wide scale with regard to protein function and disease, we investigated three representative databases; the Human Gene Mutation Database (HGMD), the Single Nucleotide Polymorphisms database (dbSNP), and the Haplotype Map (HapMap). Using these three databases, we analyzed CVs at the protein function level with bioinformatic programs. We proposed a combinatorial approach using the Support Vector Machine (SVM) to increase the performance of the prediction programs. By cataloging the coding sequence variations using these databases, we found that 4.36% of CVs from HGMD are concurrently registered in dbSNP (8.11% of CVs from dbSNP are concurrent in HGMD). The pattern of substitutions and functional consequences predicted by three bioinformatic programs was significantly different among concurrent CVs, and CVs occurring solely in HGMD or in dbSNP. The experimental results showed that the proposed SVM combination noticeably outperformed the individual prediction programs. Conclusions This is the first study to compare human sequence variations in HGMD, dbSNP and HapMap at the genome-wide level. We found that a significant proportion of CVs in HGMD and dbSNP overlap, and we emphasize the need to use caution when interpreting the phenotypic relevance of these concurrent CVs. Combining bioinformatic programs can be helpful in predicting the functional consequences of CVs because it improved the performance of functional predictions.
Tenesa A, Farrington SM, Prendergast JG, Porteous ME, Walker M, et al. (2008) Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nat Genet 40: 631–637.
Harley JB, Alarcon-Riquelme ME, Criswell LA, Jacob CO, Kimberly RP, et al. (2008) Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat Genet 40: 204–210.
Chao EC, Velasquez JL, Witherspoon MS, Rozek LS, Peel D, et al. (2008) Accurate classification of MLH1/MSH2 missense variants with multivariate analysis of protein polymorphisms-mismatch repair (MAPP-MMR). Hum Mutat 29: 852–860.
Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang J (2004) topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res 32: D520–522.
De Morais SM, Wilkinson GR, Blaisdell J, Meyer UA, Nakamura K, et al. (1994) Identification of a new genetic defect responsible for the polymorphism of (S)-mephenytoin metabolism in Japanese. Mol Pharmacol 46: 594–598.
Patrinos GP, Giardine B, Riemer C, Miller W, Chui DH, et al. (2004) Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucleic Acids Res 32: D537–541.
Thomas PD, Kejariwal A (2004) Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci U S A 101: 15398–15403.