All Title Author
Keywords Abstract

Probabilistic Phylogenetic Inference with Insertions and Deletions

DOI: 10.1371/journal.pcbi.1000172

Full-Text   Cite this paper   Add to My Lib


A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnamlε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.


[1]  Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17: 368–376.
[2]  Mau B (1996) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Ph.D. Dissertation, University of Wisconsin, Madison.
[3]  Rannala B, Yang Z (1996) Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol 43: 304–311.
[4]  Mau B, Newton MA, Larget B (1999) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55: 1–12.
[5]  Larget B, Simon D (1999) Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol Biol Evol 16: 750–759.
[6]  Durbin R, Eddy SR, Krogh A, Mitchison GJ (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.
[7]  Jukes TH, Cantor C (1965) Evolution of protein molecules. Mammalian Protein Metabolism. New York: Academic Press. pp. 21–132.
[8]  Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in protein. Atlas of Protein Sequence Structure 5: 345–352.
[9]  Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111–120.
[10]  Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 21: 160–174.
[11]  Tavaré S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci 17: 57–86.
[12]  Churchill GA (1989) Stochastic models for heterogeneous DNA sequences. Bull Math Biol 51: 79–94.
[13]  Yang Z (1994) Estimating the pattern of nucleotide substitution. J Mol Evol 39: 105–111.
[14]  Felsenstein J, Churchill GA (1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 13: 93–104.
[15]  Goldman N, Thorne JL, Jones DT (1996) Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol 263: 196–208.
[16]  Yang Z, Nielsen R, Hasegawa M (1998) Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 15: 1600–1611.
[17]  Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol 18: 691–699.
[18]  Kosiol C, Goldman N, Buttimore NH (2004) A new criterion and method for amino acid classification. J Theor Biol 228: 97–106.
[19]  Muse SV (1996) Estimating synonymous and nonsynonymous substitution rates. Mol Biol Evol 13: 105–114.
[20]  Yang Z, Nielsen R, Goldman N, Pedersen A (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431–449.
[21]  Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15: 446–454.
[22]  Smith AD, Lui TW, Tillier ER (2004) Empirical models for substitution in ribosomal RNA. Mol Biol Evol 21: 419–427.
[23]  Knudsen B, Andersen ES, Damgaard C, Kjems J, Gorodkin J (2004) Evolutionary rate variation and RNA secondary structure prediction. Comput Biol Chem 28: 219–226.
[24]  Felsenstein J (2006) PHYLIP (Phylogeny Inference Package), version 3.66. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
[25]  Swofford DL (2003) PAUP*. Phylogenetic analysis using parsimony (*and other methods,. version 4. Sunderland, Massachusetts: Sinauer Associates.
[26]  Adachi J, Hasegawa M (1995) MOLPHY programs for molecular phylogenetics, version 2.3. Tokyo: Institute of Statistical Mathematics.
[27]  Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13: 555–556.
[28]  Liò P, Goldman N, Thorne JL, Jones DT (1998) PASSML: combining evolutionary inference and protein secondary structure prediction. Bioinformatics 14: 726–733.
[29]  Simon D, Larget B (2000) Bayesian analysis in molecular biology and evolution (BAMBE), version 2.03 beta. Department of Mathematics and Computer Science, Duquesne University.
[30]  Ronquist F, Huelsenbeck JP (2003) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17: 754–7555.
[31]  Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704.
[32]  Cai W, Pei J, Grishin NV (2004) Reconstruction of ancestral protein sequences and its applications. BMC Evol Biol 4: 33.
[33]  Stamatakis A, Ludwig T, Meier H (2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21: 456–463.
[34]  Yang Z (1995) A space–time process model for the evolution of DNA sequences. Genetics 139: 993–1005.
[35]  Gribskov M, Veretnik S (1996) Identification of sequence pattern with profile analysis. Methods Enzymol 266: 198–212.
[36]  Coin L, Durbin R (2004) Improved techniques for the identification of pseudogenes. Bioinformatics Suppl 1: I94–I100.
[37]  McAuliffe JD, Pachter L, Jordan MI (2004) Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics 20: 1850–1860.
[38]  Siepel A, Haussler D (2004) Combining phylogenetic and hidden Markov models in biosequence analysis. J Comput Biol 11: 413–428.
[39]  Mitchison GJ, Durbin RM (1995) Tree-based maximal likelihood substitutions matrices and hidden Markov models. J Mol Evol 41: 1139–11351.
[40]  Mitchison GJ (1999) A probabilistic treatment of phylogeny and sequence alignment. J Mol Evol 49: 11–22.
[41]  McGuire G, Denham MC, Balding DJ (2001) Models of sequence evolution for DNA sequences containing gaps. Mol Biol Evol 18: 481–490.
[42]  Qian B, Goldstein RA (2003) Detecting distant homologs using phylogenetic tree-based HMMs. Proteins 52: 446–453.
[43]  Blanchette M, Green ED, Miller W, Haussler D (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 14: 2412–2423.
[44]  Keightley PD, Johnson T (2004) MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. Genome Res 14: 442–450.
[45]  Qian B, Goldstein RA (2004) Performance of an iterated T-HMM for homology detection. Bioinformatics 20: 2175–2180.
[46]  Rivas E (2005) Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics 6: 63.
[47]  Chindelevitch L, Li Z, Blais E, Blanchette M (2006) On the inference of parsimonious indel evolutionary scenarios. J Bioinform Comput Biol 4: 721–744.
[48]  Wang J, Keightley PD, Johnson T (2006) MCALIGN2: faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution. BMC Bioinformatics 7: 292.
[49]  Kim J, Sinha S (2007) Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics 23: 289–297.
[50]  Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33: 114–124.
[51]  Bishop MJ, Thompson EA (1986) Maximum likelihood alignment of DNA sequences. J Mol Biol 190: 159–165.
[52]  Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34: 3–16.
[53]  Thorne JL, Churchill GA (1995) Estimation and reliability of molecular sequence alignments. Biometrics 51: 100–113.
[54]  Miklós I, Toroczkai Z (2001) An improved model for statistical aligment. In: Gascuel O, Moret BME, editors. WABI 2001. Berlin Heidelberg: Springer-Verlag. pp. 1–10.
[55]  Metzler D (2003) Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19: 490–499.
[56]  Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence alignment. Mol Biol Evol 21: 529–540.
[57]  Holmes I, Bruno W (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17: 803–820.
[58]  Knudsen B, Miyamoto MM (2003) Sequence alignments and pair hidden Markov models using evolutionary history. J Mol Biol 333: 453–460.
[59]  Holmes I (2003) Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics Suppl 1: 147–157.
[60]  Pedersen JS, Hein J (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19: 219–227.
[61]  Holmes I (2004) A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 5: 166.
[62]  Fleissner R, Metzler D, von Haeseler A (2005) Simultaneous statistical multiple alignment and phylogeny reconstruction. Syst Biol 54: 548–561.
[63]  Hein J, Wiuf C, Knudsen B, Moller MB, Wibling G (2000) Statistical alignment: computational properties, homology testing and goodness-of-fit. J Mol Biol 302: 265–279.
[64]  Steel M, Hein J (2001) Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree. Appl Math Lett 14: 679–684.
[65]  Hein J (2001) An algorithm ofr statistical alignment of sequences related by a binary tree. Pac Symp Biocomput 6: 179–190.
[66]  Hein J, Jensen JL, Pedersen CN (2003) Recursions for statistical multiple alignment. Proc Natl Acad Sci U S A 100: 14960–14965.
[67]  Lunter G, Miklós I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J Mol Biol 10: 869–889.
[68]  Lunter G, Miklós I, Drummond A, Jensen J, Hein J (2005) Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6: 83.
[69]  Lunter G, Miklós I, Drummond A, Jensen J, Hein J (2003) Bayesian phylogenetic inference under a statistical insertion-deletion model. Proceedings of WABI'03. Lect Notes Bioinformatics 2812: 228–244.
[70]  Felsenstein J (2004) Inferring Phylogenies. Sunderland, Massachusetts: Sinauer.
[71]  Karlin S, McGregor J (1955) Representation of a class of stochastic processes. Proc Natl Acad Sci U S A 41: 387–391.
[72]  Moler C, Loan CV (2003) Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev 45: 3–49.
[73]  Yang Z (2006) Computational molecular evolution. London: Oxford University Press.
[74]  Boussau B, Gouy M (2006) Efficient likelihood computations with nonreversible models of evolution. Syst Biol 55: 756–768.
[75]  Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11: 459–468. Erratum in: Mol Biol Evol (1995) 12: 525.
[76]  Stoye J, Evers D, Meyer F (1998) Rose: generating sequence families. Bioinformatics 14: 157–163.
[77]  Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53: 131–147.
[78]  Pang A, Smith AD, Nuin PA, Tillier ER (2005) SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics 6: 236.
[79]  Chang MS, Benner SA (2004) Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J Mol Biol 341: 617–631.
[80]  Qian B, Goldstein RA (2001) Distribution of indel lengths. Proteins 45: 102–104.
[81]  Huelsenbeck JP (1995) The performance of phylogenetic methods in simulation. Syst Biol 44: 17–48.
[82]  Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'Souza LM, et al. (2002) The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3: 2. Correction: BMC Bioinformatics 3: 15.
[83]  Hwang DG, Green P (2004) Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci U S A 101: 13994–14001.
[84]  Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034–1050.
[85]  Nielsen R, editor. (2005) Statistical Methods in Molecular Evolution. New York: Springer-Verlag.


comments powered by Disqus

Contact Us


微信:OALib Journal