Abstract:
The Horton and Tokunaga branching laws provide a convenient framework for studying self-similarity in random trees. The Horton self-similarity is a weaker property that addresses the principal branching in a tree; it is a counterpart of the power-law size distribution for elements of a branching system. The stronger Tokunaga self-similarity addresses so-called side branching. The Horton and Tokunaga self-similarity have been empirically established in numerous observed and modeled systems, and proven for two paradigmatic models: the critical Galton-Watson branching process with finite progeny and the finite-tree representation of a regular Brownian excursion. This study establishes the Tokunaga and Horton self-similarity for a tree representation of a finite symmetric homogeneous Markov chain. We also extend the concept of Horton and Tokunaga self-similarity to infinite trees and establish self-similarity for an infinite-tree representation of a regular Brownian motion. We conjecture that fractional Brownian motions are also Tokunaga and Horton self-similar, with self-similarity parameters depending on the Hurst exponent.

Abstract:
Asymptotic analysis on some statistical properties of the random binary-tree model is developed. We quantify a hierarchical structure of branching patterns based on the Horton-Strahler analysis. We introduce a transformation of a binary tree, and derive a recursive equation about branch orders. As an application of the analysis, topological self-similarity and its generalization is proved in an asymptotic sense. Also, some important examples are presented.

Abstract:
Self-similarity of random trees is related to the operation of pruning. Pruning $R$ cuts the leaves and their parental edges and removes the resulting chains of degree-two nodes from a finite tree. A Horton-Strahler order of a vertex $v$ and its parental edge is defined as the minimal number of prunings necessary to eliminate the subtree rooted at $v$. A branch is a group of neighboring vertices and edges of the same order. The Horton numbers $N_k[K]$ and $N_{ij}[K]$ are defined as the expected number of branches of order $k$, and the expected number of order-$i$ branches that merged order-$j$ branches, $j>i$, respectively, in a finite tree of order $K$. The Tokunaga coefficients are defined as $T_{ij}[K]=N_{ij}[K]/N_j[K]$. The pruning decreases the orders of tree vertices by unity. A rooted full binary tree is said to be mean-self-similar if its Tokunaga coefficients are invariant with respect to pruning: $T_k:=T_{i,i+k}[K]$. We show that for self-similar trees, the condition $\limsup(T_k)^{1/k}<\infty$ is necessary and sufficient for the existence of the strong Horton law: $N_k[K]/N_1[K] \rightarrow R^{1-k}$, as $K \rightarrow \infty$ for some $R>0$ and every $k\geq 1$. This work is a step toward providing rigorous foundations for the Horton law that, being omnipresent in natural branching systems, has escaped so far a formal explanation.

Abstract:
We use the random self-similarity of the continuum random tree to show that it is homeomorphic to a post-critically finite self-similar fractal equipped with a random self-similar metric. As an application we determine the mean and almost-sure leading order behaviour of the high frequency asymptotics of the eigenvalue counting function associated with the natural Dirichlet form on the continuum random tree. We also obtain short time asymptotics for the trace of the heat semigroup and the annealed on-diagonal heat kernel associated with this Dirichlet form.

Abstract:
A well-established model for the genealogy of a large population in equilibrium is Kingman's coalescent. For the population together with its genealogy evolving in time, this gives rise to a time-stationary tree-valued process. We study the sum of the branch lengths, briefly denoted as tree length, and prove that the (suitably compensated) sequence of tree length processes converges, as the population size tends to infinity, to a limit process with cadlag paths, infinite infinitesimal variance, and a Gumbel distribution as its equilibrium.

Abstract:
The coalescent with recombination is a fundamental model to describe the genealogical history of DNA sequence samples from recombining organisms. Considering recombination as a process which acts along genomes and which creates sequence segments with shared ancestry, we study the influence of single recombination events upon tree characteristics of the coalescent. We focus on properties such as tree height and tree balance and quantify analytically the changes in these quantities incurred by recombination in terms of probability distributions. We find that changes in tree topology are often relatively mild under conditions of neutral evolution, while changes in tree height are on average quite large. Our results add to a quantitative understanding of the spatial coalescent and provide the neutral reference to which the impact by other evolutionary scenarios, for instance tree distortion by selective sweeps, can be compared.

Abstract:
Clustering is a process of discovering group of objects such that the objects of the same group are similar, and objects belonging to different groups are dissimilar. A number of clustering algorithms exist that can solve the problem of clustering, but most of them are very sensitive to their input parameters. Minimum Spanning Tree clustering algorithm is capable of detecting clusters with irregular boundaries. Detecting outlier in database (as unusual objects) is a big desire. In data mining detection of anomalous pattern in data is more interesting than detecting inliers. In this paper we propose a Minimum Spanning Tree based clustering algorithm for noise-free or pure clusters. The algorithm constructs hierarchy from top to bottom. At each hierarchical level, it optimizes the number of cluster, from which the proper hierarchical structure of underlying data set can be found. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the data set in order to find the proper number of clusters at each level. The algorithm works in two phases. The first phase of the algorithm create clusters with guaranteed intra-cluster similarity, where as the second phase of the algorithm create dendrogram using the clusters as objects with guaranteed inter-cluster similarity. The first phase of the algorithm uses divisive approach, where as the second phase uses agglomerative approach. In this paper we used both the approaches in the algorithm to find Best number of Meta similarity clusters.

Abstract:
One approach to estimating a species tree from a collection of gene trees is to first estimate probabilities of clades from the gene trees, and then to construct the species tree from the estimated clade probabilities. While a greedy consensus algorithm, which consecutively accepts the most probable clades compatible with previously accepted clades, can be used for this second stage, this method is known to be statistically inconsistent under the multispecies coalescent model. This raises the question of whether it is theoretically possible to reconstruct the species tree from known probabilities of clades on gene trees. We investigate clade probabilities arising from the multispecies coalescent model, with an eye toward identifying features of the species tree. Clades on gene trees with probability greater than 1/3 are shown to reflect clades on the species tree, while those with smaller probabilities may not. Linear invariants of clade probabilities are studied both computationally and theoretically, with certain linear invariants giving insight into the clade structure of the species tree. For species trees with generic edge lengths, these invariants can be used to identify the species tree topology. These theoretical results both confirm that clade probabilities contain full information on the species tree topology and suggest future directions of study for developing statistically consistent inference methods from clade frequencies on gene trees.

Abstract:
Identifiability of evolutionary tree models has been a recent topic of discussion and some models have been shown to be non-identifiable. A coalescent-based rooted population tree model, originally proposed by Nielsen et al. 1998 [2], has been used by many authors in the last few years and is a simple tool to accurately model the changes in allele frequencies in the tree. However, the identifiability of this model has never been proven. Here we prove this model to be identifiable by showing that the model parameters can be expressed as functions of the probability distributions of subsamples. This a step toward proving the consistency of the maximum likelihood estimator of the population tree based on this model.

Abstract:
Formulas are provided for the cumulants and the moments of the time $T$ back to the most recent common ancestor of the Kingman coalescent. It is shown that both the $j$th cumulant and the $j$th moment of $T$ are linear combinations of the values $\zeta(2m)$, $m\in\{0,\ldots,\lfloor j/2\rfloor\}$, of the Riemann zeta function $\zeta$ with integer coefficients. The proof is based on a solution of a two-dimensional recursion with countably many initial values. A closely related strong convergence result for the tree length $L_n$ of the Kingman coalescent restricted to a sample of size $n$ is derived. The results give reason to revisit the moments and central moments of the classical Gumbel distribution.