A scalable method for identifying frequent subtrees in sets of large phylogenetic trees
Ramu Avinash,Kahveci Tamer,Burleigh J Gordon
BMC Bioinformatics , 2012, DOI: 10.1186/1471-2105-13-256
Abstract: Background We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Conclusions Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms
J Gordon Burleigh, Khidir W Hilu, Douglas E Soltis
BMC Evolutionary Biology , 2009, DOI: 10.1186/1471-2148-9-61
Abstract: We performed maximum likelihood bootstrap analyses of the complete, 3-gene 567-taxon data matrix and the incomplete, 5-gene 567-taxon data matrix. Although the 5-gene matrix has more missing data (27.5%) than the 3-gene data matrix (2.9%), the 5-gene analysis resulted in higher levels of bootstrap support. Within the 567-taxon tree, the increase in support is most evident for relationships among the 170 taxa for which both matK and 26S rDNA sequences were added, and there is little gain in support for relationships among the 119 taxa having neither matK nor 26S rDNA sequences. The 5-gene analysis also places the enigmatic Hydrostachys in Lamiales (BS = 97%) rather than in Cornales (BS = 100% in 3-gene analysis). The placement of Hydrostachys in Lamiales is unprecedented in molecular analyses, but it is consistent with embryological and morphological data.Adding available, and often incomplete, sets of sequences to existing data sets can be a fast and inexpensive way to increase support for phylogenetic relationships and produce novel and credible new phylogenetic hypotheses.Molecular data have had an enormous impact on angiosperm phylogenetic hypotheses (e.g. [1-5]), and the abundance of new sequence data provides the potential for further resolving angiosperm relationships. Still, molecular phylogenetic studies across all angiosperms have utilized only a small fraction of the available sequence data. While GenBank currently contains over 1.7 million core nucleotide sequences from angiosperms, with over 160,000 of these being from often phylogenetically useful plastid loci [6], few phylogenetic analyses of angiosperms have included more than a thousand sequences. We examine whether augmenting existing plant data matrices with incomplete data assembled from publicly available sources can enhance the understanding of the backbone phylogenetic relationships across angiosperms.The sampling strategies of phylogenetic studies across angiosperms demonstrate a tradeoff betw
Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance
Ruchi Chaudhary,J. Gordon Burleigh,David Fernández-Baca
Computer Science , 2012,
Abstract: We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.
Constructing and Employing Tree Alignment Graphs for Phylogenetic Synthesis
Ruchi Chaudhary,David Fernandez-Baca,J. Gordon Burleigh
Computer Science , 2015,
Abstract: Tree alignment graphs (TAGs) provide an intuitive data structure for storing phylogenetic trees that exhibits the relationships of the individual input trees and can potentially account for nested taxonomic relationships. This paper provides a theoretical foundation for the use of TAGs in phylogenetics. We provide a formal definition of TAG that - unlike previous definition - does not depend on the order in which input trees are provided. In the consensus case, when all input trees have the same leaf labels, we describe algorithms for constructing majority-rule and strict consensus trees using the TAG. When the input trees do not have identical sets of leaf labels, we describe how to determine if the input trees are compatible and, if they are compatible, to construct a supertree that contains the input trees.
Exploring Diversification and Genome Size Evolution in Extant Gymnosperms through Phylogenetic Synthesis
J. Gordon Burleigh,W. Brad Barbazuk,John M. Davis,Alison M. Morse,Pamela S. Soltis
Journal of Botany , 2012, DOI: 10.1155/2012/292857
Abstract: Gymnosperms, comprising cycads, Ginkgo, Gnetales, and conifers, represent one of the major groups of extant seed plants. Yet compared to angiosperms, little is known about the patterns of diversification and genome evolution in gymnosperms. We assembled a phylogenetic supermatrix containing over 4.5 million nucleotides from 739 gymnosperm taxa. Although 93.6% of the cells in the supermatrix are empty, the data reveal many strongly supported nodes that are generally consistent with previous phylogenetic analyses, including weak support for Gnetales sister to Pinaceae. A lineage through time plot suggests elevated rates of diversification within the last 100 million years, and there is evidence of shifts in diversification rates in several clades within cycads and conifers. A likelihood-based analysis of the evolution of genome size in 165 gymnosperms finds evidence for heterogeneous rates of genome size evolution due to an elevated rate in Pinus. 1. Introduction Recent advances in sequencing technology offer the possibility of identifying the genetic mechanisms that influence evolutionarily important characters and ultimately drive diversification. Within angiosperms, large-scale phylogenetic analyses have identified complex patterns of diversification (e.g., [1–3]), and numerous genomes are at least partially sequenced. Yet the other major clade of seed plants, the gymnosperms, have received far less attention, with few comprehensive studies of diversification and no sequenced genomes. Note that throughout this paper “gymnosperms” specifies only the approximately 1000 extant species within cycads, Ginkgo, Gnetales, and conifers. These comprise the Acrogymnospermae clade described by Cantino et al. [4]. Many gymnosperms have exceptionally large genomes (e.g., [5–7]), and this has hindered whole-genome sequencing projects, especially among economically important Pinus species. This large genome size is interesting because one suggested mechanism for rapid increases in genome size, polyploidy, is rare among gymnosperms [8]. Recent sequencing efforts have elucidated some of genomic characteristics associated with the large genome size in Pinus. Morse et al. [9] identified a large retrotransposon family in Pinus, that, with other retrotransposon families, accounts for much of the genomic complexity. Similarly, recent sequencing of 10 BAC (bacterial artificial chromosome) clones from Pinus taeda identified many conifer-specific LTR (long terminal repeat) retroelements [10]. These studies suggest that the large genome size may be caused by rapid expansion of
Robinson-Foulds Supertrees
Mukul S Bansal, J Gordon Burleigh, Oliver Eulenstein, David Fernández-Baca
Algorithms for Molecular Biology , 2010, DOI: 10.1186/1748-7188-5-18
Abstract: We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (na?ve) solutions by a factor of Θ(n) and Θ(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods.Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.Supertree methods provide a formal approach for combining small phylogenetic trees with incomplete species overlap in order to build comprehensive species phylogenies, or supertrees, that contain all species found in the input trees. Supertree analyses have produced the first family-level phylogeny of flowering plants [1] and the first phylogeny of nearly all extant mammal species [2]. They have also enabled phylogenetic analyses using large-scale genomic data sets in bacteria, across eukaryotes, and within plants [3,4] and have helped elucidate the origin of eukaryotic genomes [5]. Furthermore, supertrees have been used to examine rates and patterns of species diversification [1,2], to test hy
PhyloFinder: An intelligent search engine for phylogenetic tree databases
Duhong Chen, J Gordon Burleigh, Mukul S Bansal, David Fernández-Baca
BMC Evolutionary Biology , 2008, DOI: 10.1186/1471-2148-8-90
Abstract: PhyloFinder is an intelligent search engine for phylogenetic databases that we have implemented using trees from TreeBASE. It enables taxonomic queries, in which it identifies trees in the database containing the exact name of the query taxon and/or any synonymous taxon names, and it provides spelling suggestions for the query when there is no match. Additionally, PhyloFinder can identify trees containing descendants or direct ancestors of the query taxon. PhyloFinder also performs phylogenetic queries, in which it identifies trees that contain the query tree or topologies that are similar to the query tree.PhyloFinder can enhance the utility of any tree database by providing tools for both taxonomic and phylogenetic queries as well as visualization tools that highlight the query results and provide links to NCBI and TBMap. An implementation of PhyloFinder using trees from TreeBASE is available from the web client application found in the availability and requirements section.The rapidly expanding wealth of phylogenetic information from across the tree of life offers unprecedented opportunities for large-scale evolutionary studies and for examining an array of biological questions in a phylogenetic context [1]. However, much of the published phylogenetic data is not easily accessible. Therefore, the storage and efficient retrieval of phylogenetic data are important challenges for bioinformatics [1-5]. TreeBASE is the largest relational database of published phylogenetic information. It stores more than 4,400 trees that contain over 75,000 taxa, the data matrices used to infer the trees, and additional meta-data, such as bibliographic information and details of the phylogenetic analyses [6,7]. Though TreeBASE is a valuable repository for phylogenetic data, it is often difficult to identify and access relevant phylogenetic data from within TreeBASE. In this paper, we present PhyloFinder, a new phylogenetic tree search engine that greatly expands upon the current searc
Improved Heuristics for Minimum-Flip Supertree Construction
Duhong Chen,Oliver Eulenstein,David Fernández-Baca,J.Gordon Burleigh
Evolutionary Bioinformatics , 2006,
Abstract: The utility of the matrix representation with flipping (MRF) supertree method has been limited by the speed of its heuristic algorithms. We describe a new heuristic algorithm for MRF supertree construction that improves upon the speed of the previous heuristic by a factor of n (the number of taxa in the supertree). This new heuristic makes MRF tractable for large-scale supertree analyses and allows the first comparisons of MRF with other supertree methods using large empirical data sets. Analyses of three published supertree data sets with between 267 to 571 taxa indicate that MRF supertrees are equally or more similar to the input trees on average than matrix representation with parsimony (MRP) and modified mincut supertrees. The results also show that large dif ferences may exist between MRF and MRP supertrees and demonstrate that the MRF supertree method is a practical and potentially more accurate alternative to the nearly ubiquitous MRP supertree method.
The DODO Survey II: A Gemini Direct Imaging Search for Substellar and Planetary Mass Companions around Nearby Equatorial and Northern Hemisphere White Dwarfs
E. Hogan,M. R. Burleigh,F. J. Clarke
Physics , 2009, DOI: 10.1111/j.1365-2966.2009.14565.x
Abstract: The aim of the Degenerate Objects around Degenerate Objects (DODO) survey is to search for very low mass brown dwarfs and extrasolar planets in wide orbits around white dwarfs via direct imaging. The direct detection of such companions would allow the spectroscopic investigation of objects with temperatures much lower (< 500 K) than the coolest brown dwarfs currently observed. These ultra-low mass substellar objects would have spectral types > T8.5 and so could belong to the proposed Y dwarf spectral sequence. The detection of a planet around a white dwarf would prove that such objects can survive the final stages of stellar evolution and place constraints on the frequency of planetary systems around their progenitors (with masses between 1.5 - 8 solar masses, i.e., early B to mid F). This paper presents the results of a multi-epoch J band common proper motion survey of 23 nearby equatorial and northern hemisphere white dwarfs. We rule out the presence of any common proper motion companions, with limiting masses determined from the completeness limit of each observation, to 18 white dwarfs. For the remaining five targets, the motion of the white dwarf is not sufficiently separated from the non-moving background objects in each field. These targets require additional observations to conclusively rule out the presence of any common proper motion companions. From our completeness limits, we tentatively suggest that < 5% of white dwarfs have substellar companions with effective temperatures > 500 K between projected physical separations of 60 - 200 AU.
Latest Results from the DODO Survey: Imaging Planets around White Dwarfs
E. Hogan,M. R. Burleigh,F. J. Clarke
Physics , 2011, DOI: 10.1063/1.3570984
Abstract: The aim of the Degenerate Objects around Degenerate Objects (DODO) survey is to search for very low mass brown dwarfs and extrasolar planets in wide orbits around white dwarfs via direct imaging. The direct detection of such companions would allow the spectroscopic investigation of objects with temperatures lower (< 500 K) than the coolest brown dwarfs currently observed. The discovery of planets around white dwarfs would prove that such objects can survive the final stages of stellar evolution and place constraints on the frequency of planetary systems around their progenitors (with masses between 1.5 - 8 M*, i.e., early B to mid-F). An increasing number of planetary mass companions have been directly imaged in wide orbits around young main sequence stars. For example, the planets around HR 8799 and 1RXS J160929.1 - 210524 are in wide orbits of 24 - 68 AU and 330 AU, respectively. The DODO survey has the ability to directly image planets in post-main sequence analogues of these systems. These proceedings present the latest results of our multi-epoch J band common proper motion survey of nearby white dwarfs.
