All Title Author
Keywords Abstract

PLOS ONE  2008 

Cataloging Coding Sequence Variations in Human Genome Databases

DOI: 10.1371/journal.pone.0003575

Full-Text   Cite this paper   Add to My Lib


Background With the recent growth of information on sequence variations in the human genome, predictions regarding the functional effects and relevance to disease phenotypes of coding sequence variations are becoming increasingly important. The aims of this study were to catalog protein-coding sequence variations (CVs) occurring in genetic variation databases and to use bioinformatic programs to analyze CVs. In addition, we aim to provide insight into the functionality of the reference databases. Methodology and Findings To catalog CVs on a genome-wide scale with regard to protein function and disease, we investigated three representative databases; the Human Gene Mutation Database (HGMD), the Single Nucleotide Polymorphisms database (dbSNP), and the Haplotype Map (HapMap). Using these three databases, we analyzed CVs at the protein function level with bioinformatic programs. We proposed a combinatorial approach using the Support Vector Machine (SVM) to increase the performance of the prediction programs. By cataloging the coding sequence variations using these databases, we found that 4.36% of CVs from HGMD are concurrently registered in dbSNP (8.11% of CVs from dbSNP are concurrent in HGMD). The pattern of substitutions and functional consequences predicted by three bioinformatic programs was significantly different among concurrent CVs, and CVs occurring solely in HGMD or in dbSNP. The experimental results showed that the proposed SVM combination noticeably outperformed the individual prediction programs. Conclusions This is the first study to compare human sequence variations in HGMD, dbSNP and HapMap at the genome-wide level. We found that a significant proportion of CVs in HGMD and dbSNP overlap, and we emphasize the need to use caution when interpreting the phenotypic relevance of these concurrent CVs. Combining bioinformatic programs can be helpful in predicting the functional consequences of CVs because it improved the performance of functional predictions.


[1]  Cotton RG, Appelbe W, Auerbach AD, Becker K, Bodmer W, et al. (2007) Recommendations of the 2006 Human Variome Project meeting. Nat Genet 39: 433–436.
[2]  Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, et al. (2006) The consensus coding sequences of human breast and colorectal cancers. Science 314: 268–274.
[3]  Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22: 231–238.
[4]  Sulem P, Gudbjartsson DF, Stacey SN, Helgason A, Rafnar T, et al. (2007) Genetic determinants of hair, eye and skin pigmentation in Europeans. Nat Genet 39: 1443–1452.
[5]  Han J, Kraft P, Nan H, Guo Q, Chen C, et al. (2008) A genome-wide association study identifies novel alleles associated with hair color and skin pigmentation. PLoS Genet 4: e1000074.
[6]  WTCCC (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678.
[7]  Tenesa A, Farrington SM, Prendergast JG, Porteous ME, Walker M, et al. (2008) Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nat Genet 40: 631–637.
[8]  Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, et al. (2008) Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet 40: 616–622.
[9]  Harley JB, Alarcon-Riquelme ME, Criswell LA, Jacob CO, Kimberly RP, et al. (2008) Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat Genet 40: 204–210.
[10]  Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33: D514–517.
[11]  Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, et al. (2003) Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21: 577–581.
[12]  Chan PA, Duraisamy S, Miller PJ, Newell JA, McBride C, et al. (2007) Interpreting missense variants: comparing computational methods in human disease genes CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR). Hum Mutat 28: 683–693.
[13]  Chao EC, Velasquez JL, Witherspoon MS, Rozek LS, Peel D, et al. (2008) Accurate classification of MLH1/MSH2 missense variants with multivariate analysis of protein polymorphisms-mismatch repair (MAPP-MMR). Hum Mutat 29: 852–860.
[14]  Garber JE, Offit K (2005) Hereditary cancer predisposition syndromes. J Clin Oncol 23: 276–292.
[15]  Ng PC, Henikoff S (2006) Predicting the effects of amino Acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.
[16]  Bao L, Cui Y (2005) Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 21: 2185–2190.
[17]  Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, et al. (2005) PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics 21: 3176–3178.
[18]  Ng PC, Henikoff S (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.
[19]  Ramensky V, Bork P, Sunyaev S (2002) Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30: 3894–3900.
[20]  Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang J (2004) topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res 32: D520–522.
[21]  Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, et al. (2003) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13: 2129–2141.
[22]  Yue P, Melamud E, Moult J (2006) SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics 7: 166.
[23]  Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, et al. (2000) Human gene mutation database-a biomedical information and research resource. Hum Mutat 15: 45–51.
[24]  Morisaki T, Gross M, Morisaki H, Pongratz D, Zollner N, et al. (1992) Molecular basis of AMP deaminase deficiency in skeletal muscle. Proc Natl Acad Sci U S A 89: 6457–6461.
[25]  Kaul R, Gao GP, Aloya M, Balamurugan K, Petrosky A, et al. (1994) Canavan disease: mutations among Jewish and non-jewish patients. Am J Hum Genet 55: 34–41.
[26]  De Morais SM, Wilkinson GR, Blaisdell J, Meyer UA, Nakamura K, et al. (1994) Identification of a new genetic defect responsible for the polymorphism of (S)-mephenytoin metabolism in Japanese. Mol Pharmacol 46: 594–598.
[27]  Hou JY, Luning Prak E, Kearns J, Wu J, Bassinger S, et al. (2002) A nonsense mutation in exon 3 results in the HLA-B null allele B*5127N. Tissue Antigens 60: 262–265.
[28]  Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, et al. (2006) Global variation in copy number in the human genome. Nature 444: 444–454.
[29]  Bertina RM, Koeleman BP, Koster T, Rosendaal FR, Dirven RJ, et al. (1994) Mutation in blood coagulation factor V associated with resistance to activated protein C. Nature 369: 64–67.
[30]  Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185: 862–864.
[31]  Patrinos GP, Giardine B, Riemer C, Miller W, Chui DH, et al. (2004) Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucleic Acids Res 32: D537–541.
[32]  Valiaho J, Pusa M, Ylinen T, Vihinen M (2002) IDR: the ImmunoDeficiency Resource. Nucleic Acids Res 30: 232–234.
[33]  Kryukov GV, Pennacchio LA, Sunyaev SR (2007) Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet 80: 727–739.
[34]  Voight BF, Kudaravalli S, Wen X, Pritchard JK (2006) A map of recent positive selection in the human genome. PLoS Biol 4: e72.
[35]  Thomas PD, Kejariwal A (2004) Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci U S A 101: 15398–15403.
[36]  Condit CM, Achter PJ, Lauer I, Sefcovic E (2002) The changing meanings of “mutation:” A contextualized study of public discourse. Hum Mutat 19: 69–75.
[37]  Vapnik VN (1995) The nature of statistical learning theory. New York: Springer.
[38]  Joachims T (2002) Learning to classify text using support vector machines. Boston: Kluwer Academic Publishers.


comments powered by Disqus