All Title Author
Keywords Abstract

PLOS ONE  2013 

Finding Protein-Coding Genes through Human Polymorphisms

DOI: 10.1371/journal.pone.0054210

Full-Text   Cite this paper   Add to My Lib


Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.


[1]  The International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431: 931–45.
[2]  Mathé C, Sagot M, Schiex T, Rouzé P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30: 4103–17.
[3]  Clamp M, Fry B, Kamal M, Xie X, Cuff J, et al. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A 104: 19428–33.
[4]  Brent M (2005) Genome annotation past, present, and future: how to define an ORF at each locus. Genome Research 15: 1777–86.
[5]  Genomes Project Consortium (2011) Durbin R, Abecasis G, Altshuler D, Auton A, et al. (2011) A map of human genome variation from population-scale sequencing. Nature 470: 59–65.
[6]  Cooper D, Chen J, Ball E, Howells K, Mort M, et al. (2010) Genes, mutations, and human inherited disease at the dawn of the age of personalized genomics. Hum Mutat 31: 631–55.
[7]  Halvorsen M, Martin J, Broadaway S, Laederach A (2010) Disease-associated mutations that alter the RNA structural ensemble. PLoS Genet 6: e1001074.
[8]  Ng P, Henikoff S (2002) Accounting for human polymorphisms predicted to affect protein function. Genome Research 12: 436–46.
[9]  Shimada M, Matsumoto R, Hayakawa Y, Sanbonmatsu R, Gough C, et al. (2008) VarySysDB: a human genetic polymorphism database based on all H-InvDB. Nucleic Acids Research 37: D810–5.
[10]  Yamaguchi-Kabata Y, Shimada M, Hayakawa Y, Minoshima S, Chakraboty R, et al. (2008) Distribution and effects of nonsense polymorphisms in human genes. PLOS One 3: e3393.
[11]  Wilson BA, Masel J (2011) Putatively noncoding transcripts show extensive association with ribosomes. Genome Biology and Evolution 3: 1245–1252.
[12]  Carvunis A, Rolland T, Wapinski I, Calderwood M, Yildirim M, et al. (2012) Proto-genes and denovo gene birth. Nature 487: 370–4.
[13]  Frith M, Bailey T, Kasukawa T, Mignone F, Kummerfeld S, et al. (2006a) Discrimination of non-protein-coding transcripts from protein-coding mRNA. RNA Biol 3: 40–8.
[14]  Ota T, Suzuki Y, Otsuki T, Sugiyama T, Irie R, et al. (2004) Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet 36: 40–5.
[15]  The MGC Project Team (2004) The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Research 14: 2121–2127.
[16]  Redon R, Ishikawa S, Fitch K, Feuk L, Perry G, et al. (2006) Global variation in copy number in the human genome. Nature 444: 444–454.
[17]  Spielman R, Bastone L, Burdick J, Morley M, Ewens W, et al. (2007) Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet 39: 226–31.
[18]  The International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–61.
[19]  Fujita P, Rhead B, Zweig A, Hinrichs A, Karolchik D, et al. (2011) The UCSC Genome Browser database: update 2011. Nucleic Acids Res 39: D876–D882.
[20]  Thierry-Mieg D, Thierry-Mieg J (2006) AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol 7: S12.1–14.
[21]  Pruitt K, Harrow J, Harte R, Wallin C, Diekhans M, et al. (2009) The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Research 19: 1316–23.
[22]  Hubbard T, Barker D, Birney E, Cameron G, Chen Y, et al. (2002) The Ensembl genome database project. Nucl Acids Res 30: 38–41.
[23]  Blanco E, Parra G, Guigó R (2002) Using geneid to identify genes. In: Baxevanis A, editor, Current Protocols in Bioinformatics, New York: John Wiley & Sons Inc. p. Unit 4.3.
[24]  Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 78–94.
[25]  Hsu F, Kent W, Clawson H, Kuhn R, Diekhans M, et al. (2006) The UCSC known genes. Bioinformatics 22: 1036–46.
[26]  Yamasaki C, Murakami K, Takeda J, Sato Y, Noda A, et al. (2010) H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Research 38: D626–32.
[27]  van Baren M, Brent M (2006) Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res 16: 678–85.
[28]  Pruitt K, Tatusova T, Maglott D (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 33: D501–4.
[29]  Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigó R (2001) SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Research 11: 1574–83.
[30]  Wilming L, Gilbert J, Howe K, Trevanion S, Hubbard T, et al. (2008) The vertebrate genome annotation (Vega) database. Nucleic Acids Research 36: D753–60.
[31]  Knowles D, McLysaght A (2009) Recent de novo origin of human protein-coding genes. Genome Research 21: 487–93.
[32]  Kaessmann H (2010) Origins, evolution, and phenotypic impact of new genes. Genome Research 20: 1313–26.
[33]  Tautz D, Domazet-Lo?o T (2011) The evolutionary origin of orphan genes. Nature Reviews Genetics 12: 692–70.
[34]  Wu D, Irwin D, Zhang Y (2011) De novo origin of human protein-coding genes. Plos Genetics 7: e1002379.
[35]  The UniProt Consortium (2011) Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res 39: D214–D219.
[36]  Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z (2009) Gorilla: A tool for discovery and visualization of enriched go terms in ranked gene lists. BMC Bioinformatics 48.
[37]  Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Wheeler D (2004) GenBank: update. Nucleic Acids Research 32: D23–6.
[38]  Sherry S, Ward M, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308–11.
[39]  Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics 6: 276–277.
[40]  Sharp P, Cowe E (1991) Synonymous codon usage in saccharomyces cerevisiae. Yeast 7: 657–78.
[41]  Oliver S, van der Aart Q, Agostoni-Carbone M, Aigle M, Alberghina L, et al. (1992) The complete DNA sequence of yeast chromosome III. Nature 357: 38–46.
[42]  Kie lbasa S, Wan R, Sato K, Horton P, Frith M (2011) Adaptive seeds tame genomic sequence comparison. Genome Research 21: 487–93.
[43]  Frith M (2011) A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 39: e23.
[44]  Kozak M (2002) Pushing the limits of the scanning mechanism for initiation of translation. Gene 10: 1752–9.
[45]  Kozak M (2005) Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene 361: 13–37.
[46]  Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57: 289–300.


comments powered by Disqus