|
BMC Bioinformatics 2005
Scoredist: A simple and robust protein sequence distance estimatorAbstract: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Müller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1–200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/ webcite.Estimating divergence time of protein sequences is one of the fundamental problems in bioinformatics. Evolutionary distance estimates are used by many of the most commonly used phylogenetic tree reconstruction algorithms [1-3]. In current research, phylogenetic trees are used for many types of subsequent analysis, e.g. orthology inference [4-6]. Early models for sequence evolution focussed on nucleotides. They commonly employ Markov chains and assume independent evolution at every site. Each of the four nucleotides is identified by one state and the substitution probability is modelled as a state transition probability from one state to another. In the most straightforward approach, the same state transition probability is assigned to every substitution [7]. Subsequent models take account of more nucleotide specific properties, e.g. transitional and transversional substitutions as well as GC content (see [8] for an i
|