Abstract:
Lam, Gusfield, and Sridhar (2009) showed that a set of three-state characters has a perfect phylogeny if and only if every subset of three characters has a perfect phylogeny. They also gave a complete characterization of the sets of three three-state characters that do not have a perfect phylogeny. However, it is not clear from their characterization how to find a subset of three characters that does not have a perfect phylogeny without testing all triples of characters. In this note, we build upon their result by giving a simple characterization of when a set of three-state characters does not have a perfect phylogeny that can be inferred from testing all pairs of characters.

Abstract:
Deciding whether a collection of unrooted trees is compatible is a fundamental problem in phylogenetics. Two different graph-theoretic characterizations of tree compatibility have recently been proposed. In one of these, tree compatibility is characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. An alternative characterization expresses the tree compatibility problem as a chordal graph sandwich problem in a structure known as the edge label intersection graph. In this paper we show that the characterization using edge label intersection graphs transforms to a characterization in terms of minimal cuts of the display graph. We show how these two characterizations are related to compatibility of splits. We also show how the characterization in terms of minimal cuts of display graph is related to the characterization in terms of triangulation of the display graph.

Abstract:
Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees. Tree compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. Here, we show that the latter characterization yields a natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement.

Abstract:
We consider the following basic problem in phylogenetic tree construction. Let $\mathcal{P} = \{T_1, \ldots, T_k\}$ be a collection of rooted phylogenetic trees over various subsets of a set of species. The tree compatibility problem asks whether there is a tree $T$ with the following property: for each $i \in \{1, \dots, k\}$, $T_i$ can be obtained from the restriction of $T$ to the species set of $T_i$ by contracting zero or more edges. If such a tree $T$ exists, we say that $\mathcal{P}$ is compatible. We give a $\tilde{O}(M_\mathcal{P})$ algorithm for the tree compatibility problem, where $M_\mathcal{P}$ is the total number of nodes and edges in $\mathcal{P}$. Unlike previous algorithms for this problem, the running time of our method does not depend on the degrees of the nodes in the input trees. Thus, it is equally fast on highly resolved and highly unresolved trees.

Abstract:
We characterize the compatibility of a collection of unrooted phylogenetic trees as a question of determining whether a graph derived from these trees --- the display graph --- has a specific kind of triangulation, which we call legal. Our result is a counterpart to the well known triangulation-based characterization of the compatibility of undirected multi-state characters.

Abstract:
We study a variant of one of Cotton and Wilkinson's methods, called majority-rule (+) supertrees. After proving that a key underlying problem for constructing majority-rule (+) supertrees is NP-hard, we develop a polynomial-size exact integer linear programming formulation of the problem. We then present a data reduction heuristic that identifies smaller subproblems that can be solved independently. While this technique is not guaranteed to produce optimal solutions, it can achieve substantial problem-size reduction. Finally, we report on a computational study of our approach on various real data sets, including the 121-taxon, 7-tree Seabirds data set of Kennedy and Page.The results indicate that our exact method is computationally feasible for moderately large inputs. For larger inputs, our data reduction heuristic makes it feasible to tackle problems that are well beyond the range of the basic integer programming approach. Comparisons between the results obtained by our heuristic and exact solutions indicate that the heuristic produces good answers. Our results also suggest that the majority-rule (+) approach, in both its basic form and with data reduction, yields biologically meaningful phylogenies.A supertree method begins with a collection of phylogenetic trees with possibly different leaf (taxon) sets, and assembles them into a larger phylogenetic tree, a supertree, whose taxon set is the union of the taxon sets of the input trees. Interest in supertrees was sparked by Gordon's paper [1]. Since then, particularly during the past decade, there has been a flurry of activity with many supertree methods proposed and studied from the algorithmic, theoretical, and biological points of view. The appeal of supertree synthesis is that it can combine disparate data to provide a high-level perspective that is harder to attain from individual trees. A recent example of the use of this approach is the species-level phylogeny of nearly all extant Mammalia constructed by Bin

Abstract:
We define, analyze, and give efficient algorithms for two kinds of distance measures for rooted and unrooted phylogenies. For rooted trees, our measures are based on the topologies the input trees induce on triplets; that is, on three-element subsets of the set of species. For unrooted trees, the measures are based on quartets (four-element subsets). Triplet and quartet-based distances provide a robust and fine-grained measure of the similarities between trees. The distinguishing feature of our distance measures relative to traditional quartet and triplet distances is their ability to deal cleanly with the presence of unresolved nodes, also called polytomies. For rooted trees, these are nodes with more than two children; for unrooted trees, they are nodes of degree greater than three. Our first class of measures are parametric distances, where there is a parameter that weighs the difference between an unresolved triplet/quartet topology and a resolved one. Our second class of measures are based on Hausdorff distance. Each tree is viewed as a set of all possible ways in which the tree could be refined to eliminate unresolved nodes. The distance between the original (unresolved) trees is then taken to be the Hausdorff distance between the associated sets of fully resolved trees, where the distance between trees in the sets is the triplet or quartet distance, as appropriate.

Abstract:
We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.

Abstract:
A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation in MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved.

Abstract:
We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function $f(r)$ such that, for any set $C$ of $r$-state characters, $C$ is compatible if and only if every subset of $f(r)$ characters of $C$ is compatible. We show that for every $r \ge 2$, there exists an incompatible set $C$ of $\lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$ $r$-state characters such that every proper subset of $C$ is compatible. Thus, $f(r) \ge \lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$ for every $r \ge 2$. This improves the previous lower bound of $f(r) \ge r$ given by Meacham (1983), and generalizes the construction showing that $f(4) \ge 5$ given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer $n \ge 4$, there exists an incompatible set $Q$ of $\lfloor\frac{n-2}{2}\rfloor\cdot\lceil\frac{n-2}{2}\rceil + 1$ quartets over $n$ labels such that every proper subset of $Q$ is compatible. We contrast this with a result on the compatibility of triplets: For every $n \ge 3$, if $R$ is an incompatible set of more than $n-1$ triplets over $n$ labels, then some proper subset of $R$ is incompatible. We show this upper bound is tight by exhibiting, for every $n \ge 3$, a set of $n-1$ triplets over $n$ taxa such that $R$ is incompatible, but every proper subset of $R$ is compatible.