Hierarchical orthologous groups are defined as sets of genes that have descended from a single common ancestor within a taxonomic range of interest. Identifying such groups is useful in a wide range of contexts, including inference of gene function, study of gene evolution dynamics and comparative genomics. Hierarchical orthologous groups can be derived from reconciled gene/species trees but, this being a computationally costly procedure, many phylogenomic databases work on the basis of pairwise gene comparisons instead (“graph-based” approach). To our knowledge, there is only one published algorithm for graph-based hierarchical group inference, but both its theoretical justification and performance in practice are as of yet largely uncharacterised. We establish a formal correspondence between the orthology graph and hierarchical orthologous groups. Based on that, we devise GETHOGs (“Graph-based Efficient Technique for Hierarchical Orthologous Groups”), a novel algorithm to infer hierarchical groups directly from the orthology graph, thus without needing gene tree inference nor gene/species tree reconciliation. GETHOGs is shown to correctly reconstruct hierarchical orthologous groups when applied to perfect input, and several extensions with stringency parameters are provided to deal with imperfect input data. We demonstrate its competitiveness using both simulated and empirical data. GETHOGs is implemented as a part of the freely-available OMA standalone package (http://omabrowser.org/standalone). Furthermore, hierarchical groups inferred by GETHOGs (“OMA HOGs”) on >1,000 genomes can be interactively queried via the OMA browser (http://omabrowser.org).
Kristensen DM, Wolf YI, Mushegian AR, Koonin EV (2011) Computational methods for Gene Orthology inference. Briefings in bioinformatics 12: 379–391.
[3]
Altenhoff AM, Dessimoz C (2012) Inferring orthology and paralogy. In: Anisimova M, editor, Evolutionary Genomics, Clifton, NJ, USA: Methods in Molecular Biology. 259–279.
[4]
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96: 2896–2901.
Remm M, Storm C, Sonnhammer E (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314: 1041–52.
[7]
Roth AC, Gonnet GH, Dessimoz C (2008) The algorithm of OMA for large-scale orthology inference. BMC Bioinformatics 9: 518.
[8]
Altenhoff AM, Dessimoz C (2009) Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol 5: e1000262.
[9]
Boeckmann B, Robinson-Rechavi M, Xenarios I, Dessimoz C (2011) Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees. Brief Bioinform 12: 474–484.
[10]
Li L, Stoeckert CJJ, Roos DS (2003) Orthomcl: identi_cation of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189.
[11]
Chen C, Natale DA, Finn RD, Huang H, Zhang J, et al. (2011) Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLoS ONE 6: e18910.
[12]
Miele V, Penel S, Duret L (2011) Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics 12: 116.
[13]
Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, et al. (2012) eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Research 40: D284–9.
[14]
Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV (2011) OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Research 39: D283–8.
[15]
Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39: D289–D294.
[16]
van der Heijden RT, Snel B, van Noort V, Huynen MA (2007) Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics 8: 83.
[17]
Vilella AJJ, Severin J, Ureta-Vidal A, Durbin R, Heng L, et al. (2009) Ensemblcompara genetrees: Analysis of complete, duplication aware phylogenetic trees in vertebrates. Genome research 19: 327–335.
[18]
Wapinski I, Pfeffer A, Friedman N, Regev A (2007) Natural history and evolutionary principles of gene duplication in fungi. Nature 449: 54–61.
[19]
Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T (2008) Phylomedb: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res 36: D491–D496.
[20]
Chen F, Mackey AJ, Stoeckert CJ, Roos DS (2006) Orthomcl-db: querying a comprehensive multispecies collection of ortholog groups. Nucleic Acids Res 34: D363–D368.
[21]
Ostlund G, Schmitt T, Forslund K, K?stler T, Messina DN, et al. (2010) Inparanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 38: D196–D203.
[22]
DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, et al. (2006) Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics 22: 2044–2046.
[23]
Jothi R, Zotenko E, Tasneem A, Przytycka TM (2006) Coco-cl: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics 22: 779–788.
[24]
Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C (2012) ALF–A Simulation Framework for Genome Evolution. Molecular Biology and Evolution 29: 1115–1123.
[25]
Yang Z, Nielsen R, Goldman N, Pedersen AMK (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 432–449.
[26]
Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, et al. (2009) Database resources of the national center for biotechnology information. Nucleic acids research 37: D5–15.
[27]
Linard B, Thompson J, Poch O, Lecompte O (2011) Orthoinspector: comprehensive orthology analysis and visual exploration. BMC Bioinformatics 12: 11.
[28]
Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, et al. (2011) Orthology prediction methods: a quality assessment using curated protein families. BioEssays 33: 769–780.
[29]
Dessimoz C, Boeckmann B, Roth A, Gonnet GH (2006) Detecting non-orthology in the cog database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res 34: 3309–3316.
[30]
Gonnet GH, Hallett MT, Korostensky C, Bernardin L (2000) Darwin v. 2.0: An interpreted computer language for the biosciences. Bioinformatics 16: 101–103.
[31]
Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, et al. (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26: 1481–1487.
[32]
Neyman J (1971) Molecular studies of evolution: a source of novel statistical problems. In: Gupta S, Yackel J, editors, Statistical decision theory and related topics, Academic Press, New York. 1–27.
[33]
Tarjan RE, van Leeuwen J (1984) Worst-case analysis of set union algorithms. Journal of the ACM 31: 245–281.
[34]
Ford L, Fulkerson D (1956) Maximal ow through a network. Canadian Journal of Mathematics 8: 399–404.
[35]
Karger DR, Stein C (1996) A new approach to the minimum cut problem. J ACM 43: 601–640.