Arabidopsis thaliana became the model organism for plant studies because of its small diploid genome, rapid lifecycle and short adult size. Its genome was the first among plants to be sequenced, becoming the reference in plant genomics. However, the Arabidopsis genome is characterized by an inherently complex organization, since it has undergone ancient whole genome duplications, followed by gene reduction, diploidization events and extended rearrangements, which relocated and split up the retained portions. These events, together with probable chromosome reductions, dramatically increased the genome complexity, limiting its role as a reference. The identification of paralogs and single copy genes within a highly duplicated genome is a prerequisite to understand its organization and evolution and to improve its exploitation in comparative genomics. This is still controversial, even in the widely studied Arabidopsis genome. This is also due to the lack of a reference bioinformatics pipeline that could exhaustively identify paralogs and singleton genes. We describe here a complete computational strategy to detect both duplicated and single copy genes in a genome, discussing all the methodological issues that may strongly affect the results, their quality and their reliability. This approach was used to analyze the organization of Arabidopsis nuclear protein coding genes, and besides classifying computationally defined paralogs into networks and single copy genes into different classes, it unraveled further intriguing aspects concerning the genome annotation and the gene relationships in this reference plant species. Since our results may be useful for comparative genomics and genome functional analyses, we organized a dedicated web interface to make them accessible to the scientific community.
References
[1]
Meyerowitz, E.; Somerville, C. Arabidopsis, Cold Spring Harbor Monograph Series; Cold Spring Harbor Laboratory Press: New York, NY, USA, 1994.
[2]
Somerville, C. Arabidopsis blooms. Plant Cell 1989, 1, 1131.
[3]
Somerville, C.; Koornneef, M. A fortunate choice: The history of Arabidopsis as a model plant. Nat. Rev. Genet. 2002, 3, 883–889.
[4]
Meinke, D.W.; Cherry, J.M.; Dean, C.; Rounsley, S.D.; Koornneef, M. Arabidopsis thaliana: A model plant for genome analysis. Science 1998, 282, 679–682.
[5]
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408, 796–815.
[6]
Hall, A.E.; Fiebig, A.; Preuss, D. Beyond the Arabidopsis genome: Opportunities for comparative genomics. Plant Physiol. 2002, 129, 1439–1447.
[7]
Yogeeswaran, K.; Frary, A.; York, T.L.; Amenta, A.; Lesser, A.H.; Nasrallah, J.B.; Tanksley, S.D.; Nasrallah, M.E. Comparative genome analyses of Arabidopsis spp.: Inferring chromosomal rearrangement events in the evolutionary history of A. thaliana. Genome Res. 2005, 15, 505–515.
[8]
Taji, T.; Seki, M.; Satou, M.; Sakurai, T.; Kobayashi, M.; Ishiyama, K.; Narusaka, Y.; Narusaka, M.; Zhu, J.K.; Shinozaki, K. Comparative genomics in salt tolerance between Arabidopsis and Arabidopsis-related halophyte salt cress using Arabidopsis microarray. Plant Physiol. 2004, 135, 1697–1709.
[9]
Nelson, D.R.; Schuler, M.A.; Paquette, S.M.; Werck-Reichhart, D.; Bak, S. Comparative genomics of rice and Arabidopsis. Analysis of 727 cytochrome P450 genes and pseudogenes from a monocot and a dicot. Plant Physiol. 2004, 135, 756–772.
[10]
Town, C.D.; Cheung, F.; Maiti, R.; Crabtree, J.; Haas, B.J.; Wortman, J.R.; Hine, E.E.; Althoff, R.; Arbogast, T.S.; Tallon, L.J.; et al. Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation, and dispersal after polyploidy. Plant Cell 2006, 18, 1348–1359.
[11]
Ku, H.M.; Vision, T.; Liu, J.; Tanksley, S.D. Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl. Acad. Sci. USA 2000, 97, 9121–9126.
[12]
Boivin, K.; Acarkan, A.; Mbulu, R.S.; Clarenz, O.; Schmidt, R. The Arabidopsis genome sequence as a tool for genome analysis in Brassicaceae. A comparison of the Arabidopsis and Capsella rubella genomes. Plant Physiol. 2004, 135, 735–744.
[13]
International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 2005, 436, 793–800.
[14]
Ming, R.; Hou, S.; Feng, Y.; Yu, Q.; Dionne-Laporte, A.; Saw, J.H.; Senin, P.; Wang, W.; Ly, B.V.; Lewis, K.L.; et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 2008, 452, 991–996.
[15]
Somers, D.A.; Langridge, P.; Gustafson, J.P. Plant Genomics: Methods And Protocols; Humana Press: New York, NY, USA, 2009.
[16]
Koch, M.A.; Kiefer, M. Genome evolution among cruciferous plants: A lecture from the comparison of the genetic maps of three diploid species—Capsella rubella, Arabidopsis lyrata subsp. petraea, and A. thaliana. Am. J. Bot. 2005, 92, 761–767.
[17]
Koornneef, M.; Meinke, D. The development of Arabidopsis as a model plant. Plant J. 2010, 61, 909–921.
[18]
Simillion, C.; Vandepoele, K.; van Montagu, M.C.; Zabeau, M.; van de Peer, Y. The hidden duplication past of Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 2002, 99, 13627–13632.
[19]
Debodt, S.; Maere, S.; Vandepeer, Y. Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 2005, 20, 591–597.
[20]
Vision, T.J.; Brown, D.G.; Tanksley, S.D. The origins of genomic duplications in Arabidopsis. Science 2000, 290, 2114–2117.
[21]
Blanc, G.; Barakat, A.; Guyot, R.; Cooke, R.; Delseny, M. Extensive duplication and reshuffling in the Arabidopsis genome. Plant Cell 2000, 12, 1093–1101.
[22]
Simillion, C.; Vandepoele, K.; van Montagu, M.C.; Zabeau, M.; van de Peer, Y. The hidden duplication past of Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 2002, 99, 13627–13632.
[23]
Wolfe, K.H. Yesterday's polyploids and the mystery of diploidization. Nat. Rev. Genet. 2001, 2, 333–341.
[24]
Cui, L.; Wall, P.K.; Leebens-Mack, J.H.; Lindsay, B.G.; Soltis, D.E.; Doyle, J.J.; Soltis, P.S.; Carlson, J.E.; Arumuganathan, K.; Barakat, A.; et al. Widespread genome duplications throughout the history of flowering plants. Genome Res. 2006, 16, 738–749.
[25]
Blanc, G.; Hokamp, K.; Wolfe, K.H. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 2003, 13, 137–144.
[26]
Blanc, G.; Wolfe, K.H. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 2004, 16, 1679–1691.
[27]
Van de Peer, Y.; Meyer, A. Large-scale gene and ancient genome duplications. In The Evolution of The Genome; Elsevier Academic Press: San Diego, CA, USA, 2005; pp. 328–368.
[28]
Jiao, Y.; Leebens-Mack, J.; Ayyampalayam, S.; Bowers, J.E.; McKain, M.R.; McNeal, J.; Rolf, M.; Ruzicka, D.R.; Wafula, E.; Wickett, N.J.; et al. A genome triplication associated with early diversification of the core eudicots. Genome Biol. 2012, 13, R3.
Van de Peer, Y. A mystery unveiled. Genome Biol. 2011, 12, 113.
[31]
Lysak, M.A.; Fransz, P.F.; Ali, H.B.; Schubert, I. Chromosome painting in Arabidopsis thaliana. Plant J. 2001, 28, 689–697.
[32]
Lysak, M.A.; Koch, M.A.; Pecinka, A.; Schubert, I. Chromosome triplication found across the tribe Brassiceae. Genome Res. 2005, 15, 516–525.
[33]
Tang, H.; Bowers, J.E.; Wang, X.; Ming, R.; Alam, M.; Paterson, A.H. Synteny and collinearity in plant genomes. Science 2008, 320, 486–488.
[34]
Conner, J.A.; Conner, P.; Nasrallah, M.E.; Nasrallah, J.B. Comparative mapping of the Brassica S locus region and its homeolog in Arabidopsis: Implications for the evolution of mating systems in the Brassicaceae. Plant Cell Online 1998, 10, 801–812.
[35]
Johnston, J.; Pepper, A.; Hall, A.; Chen, Z.; Hodnett, G.; Drabek, J.; Lopez, R.; Price, H. Evolution of genome size in Brassicaceae. Ann. Bot. 2005, 95, 229–235.
[36]
Rong, J.; Bowers, J.E.; Schulze, S.R.; Waghmare, V.N.; Rogers, C.J.; Pierce, G.J.; Zhang, H.; Estill, J.C.; Paterson, A.H. Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy. Genome Res. 2005, 15, 1198–1210.
[37]
Jaillon, O.; Aury, J.; Noel, B.; Policriti, A.; Clepet, C.; Casagrande, A.; Choisne, N.; Aubourg, S.; Vitulo, N.; Jubin, C.; et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449, 463–467.
[38]
Adams, K.L.; Wendel, J.F. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 2005, 8, 135–141.
Ermolaeva, M.; Wu, M.; Eisen, J.; Salzberg, S. The age of the Arabidopsis thaliana genome duplication. Plant Mol. Biol. 2003, 51, 859–866.
[41]
Raes, J.; Klaas, V.; Klaas, V.; Simillion, C.; Saeys, Y.; van de Peer, Y. Investigating ancient duplication events in the Arabidopsis genome. J. Struct. Funct. Genomics 2003, 3, 117–129.
[42]
Seoighe, C.; Gehring, C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 2004, 20, 461–464.
[43]
Taylor, J.S.; Raes, J. Duplication and divergence: The evolution of new genes and old ideas. Annu. Rev. Genet. 2004, 38, 615–643.
[44]
Duarte, J.; Wall, P.K.; Edger, P.; Landherr, L.; Ma, H.; Pires, J.C.; Leebens-Mack, J.; dePamphilis, C. Identification of shared single copy nuclear genes in Arabidopsis, populus, vitis and oryza and their phylogenetic utility across various taxonomic levels. BMC Evol. Biol. 2010, 10, 61.
[45]
Proost, S.; van Bel, M.; Sterck, L.; Billiau, K.; van Parys, T.; van de Peer, Y.; Vandepoele, K. PLAZA: A comparative genomics resource to study gene and genome evolution in plants. Plant Cell 2009, 21, 3718–3731.
The PHP scripting language network. . Available online: http://www.php.net/ (accessed on 27 November 2013).
[72]
The MySQL open source database. . Available online: http://www.mysql.com/ (accessed on 27 November 2013).
[73]
Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504.
[74]
Moreno-Hagelsieb, G.; Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 2008, 24, 319–324.
[75]
Kawabe, A.; Hansson, B.; Hagenblad, J.; Forrest, A.; Charlesworth, D. Centromere locations and associated chromosome rearrangements in Arabidopsis lyrata and A. thaliana. Genetics 2006, 173, 1613–1619.
[76]
Pérez-Rodríguez, P.; Ria?o-Pachón, D.M.; Corrêa, L.G.G.; Rensing, S.A.; Kersten, B.; Mueller-Roeber, B. PlnTFDB: Updated content and new features of the plant transcription factor database. Nucleic Acids Res. 2010, 38, D822–D827.
[77]
Punta, M.; Coggill, P.C.; Eberhardt, R.Y.; Mistry, J.; Tate, J.; Boursnell, C.; Pang, N.; Forslund, K.; Ceric, G.; Clements, J.; et al. The Pfam protein families database. Nucleic Acids Res. 2012, 40, D290–D301.
[78]
Rosenfeld, J.; DeSalle, R. E value cutoff and eukaryotic genome content phylogenetics. Mol. Phylogenet. Evol. 2012, 63, 342–350.
[79]
Van de Peer, Y.; Fawcett, J.; Proost, S.; Sterck, L.; Vandepoele, K. The flowering world: A tale of duplications. Trends Plant Sci. 2009, 14, 680–688.
[80]
pARsi: paralogs and singleton genes browser for Arabidopsis. . Available online: http://biosrv.cab.unina.it/athparalogs/main/index/ (accessed on 27 November 2013).
[81]
Fransz, P.; Armstrong, S.; Alonso-Blanco, C.; Fischer, T.C.; Torres-Ruiz, R.A.; Jones, G. Cytogenetics for the model system Arabidopsis thaliana. Plant J. 1998, 13, 867–876.
[82]
Van de Peer, Y. Computational approaches to unveiling ancient genome duplications. Nat. Rev. Genet. 2004, 5, 752–763.
[83]
Wootton, J.C.; Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 1993, 17, 149–163.