Abstract:
The Robinson-Foulds (RF) metric is arguably the most widely used measure of phylogenetic tree similarity, despite its well-known shortcomings: For example, moving a single taxon in a tree can result in a tree that has maximum distance to the original one; but the two trees are identical if we remove the single taxon. To this end, we propose a natural extension of the RF metric that does not simply count identical clades but instead, also takes similar clades into consideration. In contrast to previous approaches, our model requires the matching between clades to respect the structure of the two trees, a property that the classical RF metric exhibits, too. We show that computing this generalized RF metric is, unfortunately, NP-hard. We then present a simple Integer Linear Program for its computation, and evaluate it by an all-against-all comparison of 100 trees from a benchmark data set. We find that matchings that respect the tree structure differ significantly from those that do not, underlining the importance of this natural condition.

Abstract:
We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (na？ve) solutions by a factor of Θ(n) and Θ(n2) respectively (where n is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods.Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.Supertree methods provide a formal approach for combining small phylogenetic trees with incomplete species overlap in order to build comprehensive species phylogenies, or supertrees, that contain all species found in the input trees. Supertree analyses have produced the first family-level phylogeny of flowering plants [1] and the first phylogeny of nearly all extant mammal species [2]. They have also enabled phylogenetic analyses using large-scale genomic data sets in bacteria, across eukaryotes, and within plants [3,4] and have helped elucidate the origin of eukaryotic genomes [5]. Furthermore, supertrees have been used to examine rates and patterns of species diversification [1,2], to test hy

Abstract:
molecular trees of trypanosomes have confirmed conventionally accepted genera, but often produce topologies that are incongruent with knowledge of the evolution, systematics, and biogeography of hosts and vectors. these distorted topologies result largely from incorrect assumptions about molecular clocks. a host-based phylogenetic tree could serve as a broad outline against which the reasonability of molecular phylogenies could be evaluated. the host-based tree of trypanosomes presented here supports the " invertebrate first " hypothesis of trypansosome evolution, supports the monophyly of trypanosomatidae, and indicates the digenetic lifestyle arose three times. an area cladogram of leishmania supports origination in the palaearctic during the palaeocene.

Abstract:
Molecular trees of trypanosomes have confirmed conventionally accepted genera, but often produce topologies that are incongruent with knowledge of the evolution, systematics, and biogeography of hosts and vectors. These distorted topologies result largely from incorrect assumptions about molecular clocks. A host-based phylogenetic tree could serve as a broad outline against which the reasonability of molecular phylogenies could be evaluated. The host-based tree of trypanosomes presented here supports the " invertebrate first " hypothesis of trypansosome evolution, supports the monophyly of Trypanosomatidae, and indicates the digenetic lifestyle arose three times. An area cladogram of Leishmania supports origination in the Palaearctic during the Palaeocene.

Abstract:
When a phylogenetic reconstruction does not result in one tree but in several, tree metrics permit finding out how far the reconstructed trees are from one another. They also permit to assess the accuracy of a reconstruction if a true tree is known. TreeCmp implements eight metrics that can be calculated in polynomial time for arbitrary (not only bifurcating) trees: four for unrooted (Matching Split metric, which we have recently proposed, Robinson-Foulds, Path Difference, Quartet) and four for rooted trees (Matching Cluster, Robinson-Foulds cluster, Nodal Splitted and Triple). TreeCmp is the first implementation of Matching Split/Cluster metrics and the first efficient and convenient implementation of Nodal Splitted. It allows to compare relatively large trees. We provide an example of the application of TreeCmp to compare the accuracy of ten approaches to phylogenetic reconstruction with trees up to 5000 external nodes, using a measure of accuracy based on normalized similarity between trees.

Abstract:
In this paper we introduce and study three new measures for efficient discriminative comparison of phylogenetic trees. The NNI navigation dissimilarity $d_{nav}$ counts the steps along a "combing" of the Nearest Neighbor Interchange (NNI) graph of binary hierarchies, providing an efficient approximation to the (NP-hard) NNI distance in terms of "edit length". At the same time, a closed form formula for $d_{nav}$ presents it as a weighted count of pairwise incompatibilities between clusters, lending it the character of an edge dissimilarity measure as well. A relaxation of this formula to a simple count yields another measure on all trees --- the crossing dissimilarity $d_{CM}$. Both dissimilarities are symmetric and positive definite (vanish only between identical trees) on binary hierarchies but they fail to satisfy the triangle inequality. Nevertheless, both are bounded below by the widely used Robinson-Foulds metric and bounded above by a closely related true metric, the cluster-cardinality metric $d_{CC}$. We show that each of the three proposed new dissimilarities is computable in time $O(n^2)$ in the number of leaves $n$, and conclude the paper with a brief numerical exploration of the distribution over tree space of these dissimilarities in comparison with the Robinson-Foulds metric and the more recently introduced matching-split distance.

Abstract:
A central theme in phylogenetics is the reconstruction and analysis of evolutionary trees from a given set of data. To determine the optimal search methods for reconstructing trees, it is crucial to understand the size and structure of the neighbourhoods of trees under tree rearrangement operations. The diameter and size of the immediate neighbourhood of a tree has been well-studied, however little is known about the number of trees at distance two, three or (more generally) $k$ from a given tree. In this paper we provide a number of exact and asymptotic results concerning these quantities, and identify some key aspects of tree shape that play a role in determining these quantities. We obtain several new results for two of the main tree rearrangement operations - Nearest Neighbour Interchange and Subtree Prune and Regraft -- as well as for the Robinson-Foulds metric on trees.

Abstract:
We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods.REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET.In the area of microarray data analysis, inferring gene-gene interactions involved in biological function is a relevant task. Over the past few years several statistical and machine learning techniques have been proposed to carry out the inferring task of gene-gene interactions or gene regulatory networks. Clustering algorithm represents one of the first approaches to support th

Abstract:
In 1982, Penny, Foulds and Hendy [1] made a test of the theory of evolution by comparing phylogenetic trees constructed from different protein-coding genes from the same set of species. Specifically, they tested whether a unique evolutionary tree relating these genes existed or not, and whether it could be recovered. The existence of such a tree was important, not only to confirm the theory of evolution, but also to show that this theory allowed quantitative and falsifiable predictions. At that time, five proteins from 11 mammalian species were available for the study, but each protein produced different trees. At first sight, this contradicted the existence of a unique tree. However, the authors, being aware of the methodological difficulties in phlyogenetic reconstruction, did not expect the five trees to be identical. Rather, they expected them to be similar. To measure topological dissimilarity between trees they made use of the symmetric difference distance (also know as the Robinson-Foulds distance), which had just been introduced [2], and found that the trees obtained from the five genes were indeed more similar than expected by chance, proving the existence of a unique tree relating these sequences. This was a simple but powerful study that opened the way to test evolutionary hypotheses by means of multi-gene studies or 'phylogenomics'.Now, 25 years later, a much larger multi-gene study has been published by Huerta-Cepas et al. [3], using the complete human proteome and the homologous genes in the other complete genome sequences now available. This time, 21,588 trees, each including a different human gene, were obtained from the genomes of 39 species of eukaryotes. What can we learn from such a large number of phylogenies?Nowadays, we take it as read that eukaryotes are related by a unique phylogenetic tree. Instead, the focus of recent phylogenetic work has shifted to studying whether we can determine the exact branching pattern of this tree. The results ha

Abstract:
Consensus methods provide a useful strategy for combining information from a collection of gene trees. An important application of consensus methods is to combine gene trees to estimate a species tree. To investigate the theoretical properties of consensus trees that would be obtained from large numbers of loci evolving according to a basic evolutionary model, we construct consensus trees from independent gene trees that occur in proportion to gene tree probabilities derived from coalescent theory. We consider majority-rule, rooted triple (R*), and greedy consensus trees constructed from known gene trees, both in the asymptotic case as numbers of gene trees approach infinity and for finite numbers of genes. Our results show that for some combinations of species tree branch lengths, increasing the number of independent loci can make the majority-rule consensus tree more likely to be at least partially unresolved and the greedy consensus tree less likely to match the species tree. However, the probability that the R* consensus tree has the species tree topology approaches 1 as the number of gene trees approaches infinity. Although the greedy consensus algorithm can be the quickest to converge on the correct species tree when increasing the number of gene trees, it can also be positively misleading. The majority-rule consensus tree is not a misleading estimator of the species tree topology, and the R* consensus tree is a statistically consistent estimator of the species tree topology. Our results therefore suggest a method for using multiple loci to infer the species tree topology, even when it is discordant with the most likely gene tree.