Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
Multiple sequence alignment accuracy and evolutionary distance estimation
Michael S Rosenberg
BMC Bioinformatics , 2005, DOI: 10.1186/1471-2105-6-278
Abstract: The maximal gain in alignment accuracy was found not when the third sequence is directly intermediate between the initial two sequences, but rather when it perfectly subdivides the branch leading from the root of the tree to one of the original sequences (making it half as close to one sequence as the other). Evolutionary distance estimation in the multiple alignment framework, however, is largely unrelated to alignment accuracy and rather is dependent on the position of the third sequence; the closer the branch leading to the third sequence is to the root of the tree, the larger the estimated distance between the first two sequences.The bias in distance estimation appears to be a direct result of the standard greedy progressive algorithm used by many multiple alignment methods. These results have implications for choosing new taxa and genomes to sequence when resources are limited.DNA sequence alignment is a common step in molecular evolutionary analysis. Aligned sequences are used for many purposes, including estimation of patterns of divergence, selection, the tempo and mode of evolutionary change, identification of functional elements and constraints, and phylogenetic history, just to name a few. Alignments are a hypothesis of site homology; as evolutionary distance among sequences increases, alignments are known to become less accurate [1-7]. The effect of alignment accuracy on downstream analysis in comparative genomics and bioinformatics is largely an unexplored topic, although some empirical studies have attempted to examine this with respect to functional element identification [8,9] and phylogenetic analysis [10-16].Multiple sequence alignment, the alignment of more than two sequences, is generally thought to lead to more accurate alignments than simple pair wise alignments [4]. There are numerous approaches to multiple alignment, although most are based in some way on a progressive alignment algorithm [17,18] where similar sequences are aligned first and
A comparative analysis of progressive multiple sequence alignment approaches using UPGMA and neighbor joining based guide trees  [PDF]
Ravi Kumar Yadav Dega,Gunes Ercal
Computer Science , 2015,
Abstract: Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence alignment, an important way of which is the progressive alignment approach studied in this work. Progressive alignment involves three steps: find the distance between each pair of sequences; construct a guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance and more accurate in terms of overall cost minimization.
Parallel progressive multiple sequence alignment on reconfigurable meshes  [cached]
Nguyen Ken D,Pan Yi,Nong Ge
BMC Genomics , 2011, DOI: 10.1186/1471-2164-12-s5-s4
Abstract: Background One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand. Results We design O(1) run-time solutions for both local and global dynamic programming pair-wise alignment algorithms on reconfigurable mesh computing model. To align m sequences with max length n, we combining the parallel pair-wise dynamic programming solutions with newly designed parallel components. We successfully reduce the progressive multiple sequence alignment algorithm's run-time complexity from O(m × n4) to O(m) using O(m × n3) processing units for scoring schemes that use three distinct values for match/mismatch/gap-extension. The general solution to multiple sequence alignment algorithm takes O(m × n4) processing units and completes in O(m) time. Conclusions To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with O(m) run-time. We also provide a new parallel algorithm for the Longest Common Subsequence (LCS) with O(1) run-time using O(n3) processing units. This is a big improvement over the current best constant-time algorithm that uses O(n4) processing units.
Evolutionary distance estimation and fidelity of pair wise sequence alignment
Michael S Rosenberg
BMC Bioinformatics , 2005, DOI: 10.1186/1471-2105-6-102
Abstract: Under the studied conditions, distance estimation was relatively unaffected by alignment error (50% or more of the sites incorrectly aligned) as long as 50% or more of the sites were identical among the sequences (observed P-distance < 0.5). Beyond this threshold, the alignment procedure artificially inflates the apparent sequence identity, skewing distance estimates, and creating alignments that are essentially indistinguishable from random data. This general result was independent of substitution model, sequence length, and insertion and deletion size and rate.Examination of the estimated sequence identity may yield some guidance as to the accuracy of the alignment. Inaccurate alignments are expected to have large effects on analyses dependent on site specificity, but analyses that depend on evolutionary distance may be somewhat robust to alignment error as long as fewer than half of the sites have diverged.Evolutionary distance, the number of substitutions per site separating a pair of homologous sequences since they diverged from their common ancestral sequence, is an extremely important measure in molecular evolution and comparative genomics. It is used for a wide variety of purposes, ranging from phylogenetic analysis [1,2], to estimating times of divergence [3,4], the tempo and mode of evolutionary change [5], and functional constraints [6,7]. Evolutionary distance estimation is often one of the first steps in high-throughput sequence analysis; errors in these estimates may have wide-ranging consequences on downstream analyses and conclusions.There are many ways to estimate evolutionary distance; accuracy of various methods tends to be dependent on proper specification of the substitution model and sequence length [8,9]. One factor that has not been well examined with respect to evolutionary distance estimation, however, is alignment (although see [10-12]). Sequence alignment is an extremely common analytical tool used in comparative genomics. The purpose of
Reticular alignment: A progressive corner-cutting method for multiple sequence alignment
Adrienn Szabó, ádám Novák, István Miklós, Jotun Hein
BMC Bioinformatics , 2010, DOI: 10.1186/1471-2105-11-570
Abstract: We implemented the program in the Java programming language, and tested it on the BAliBASE database. Reticular Alignment can outperform ClustalW even if a very simple scoring scheme (BLOSUM62 and affine gap penalty) is implemented and merely the threshold value is increased. However, this set-up is not sufficient for outperforming other cutting-edge alignment methods. On the other hand, the reticular alignment search strategy together with sophisticated scoring schemes (for example, differentiating gap penalties for hydrophobic and hydrophylic amino acids) overcome FSA and in some accuracy measurement, even MAFFT. The program is available from http://phylogeny-cafe.elte.hu/RetAlign/ webciteReticular alignment is an efficient search strategy for finding accurate multiple alignments. The highest accuracy achieved when this searching strategy is combined with sophisticated scoring schemes.The multiple sequence alignment problem is still the Holy Grail of bioinformatics [1]. There are 517100 sequences in the UniProtKB/Swiss-Prot release of the 18th of May 2010 http://expasy.org/sprot/ webcite, while on the other hand, there are only 65802 known structures in the last PDB database relase of the 8th of June 2010 http://www.pdb.org/pdb/home/home.do webcite. Therefore, the in silico prediction of protein structures is still demanding, and the majority of the protein structure prediction methods need accurate alignments. There are two major technical hurdles in the multiple sequence alignment problem. The first is the scoring problem: how to score the alignments such that the best scored alignment is the most accurate one. The second is the algorithmic problem: how to find the best scored alignment.Significantly more effort has been put into the research for solving the second challenge. Although the number of possible alignments of two sequences grows exponentially with the length of the sequences, finding the best scoring alignment of two sequences is computationally feasi
Sequence embedding for fast construction of guide trees for multiple sequence alignment
Gordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins
Algorithms for Molecular Biology , 2010, DOI: 10.1186/1748-7188-5-21
Abstract: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz webcite.The majority of multiple sequence alignment (MSA) methods use some form of progressive alignment [1-7]. In progressive alignment the usual first step is to compute a pair-wise distance matrix which is then used to make a so called guide tree, in order to determine the order of alignment of the input sequences. The computation of the distance matrix requires N (N - 1)/2 pair-wise comparisons, N being the number of sequences. Construction of the guide tree, usually has an additional time complexity of (N2) to (N3), depending on the algorithm used and its implementation. The complexity of these steps can become prohibitive when N becomes very large e.g. when N is in the tens of thousands. There are very few multiple alignment programs that can handle datasets of this size, with MUSCLE and MAFFT being the most familiar [6,7]. Some of the most accurate multiple sequence alignment methods can only routinely handle sequences numbering in the hundreds [4,8,9]. The explosive growth in the number of sequences coming from genomic studies means that the ability to cluster and align greater numbers of sequences is becoming even more important. For example, the Ribosomal Database Project [10] Release 10 consists of more than a million sequences.In order to make very large guide trees, the fi
DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment
Amarendran R Subramanian, Michael Kaufmann, Burkhard Morgenstern
Algorithms for Molecular Biology , 2008, DOI: 10.1186/1748-7188-3-6
Abstract: Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called DIRM-BASE 1. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences.On BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.DIALIGN is a widely used software for multiple alignment of nucleic acid and protein sequences [1,2] that combines local and global alignment features. Pairwise or multiple alignments are composed by aligning local pairwise similarities. More precisely, pairwise local gap-free alignments called fragment alignments or fragments are used as building blocks to assemble multiple alignments. Each possible fragment is given a score that is related to the P values used by BLAST [3,4], and the program then tries to find a consistent set of fragments from all possible sequence pairs, maximizing the total score of these fragments. Gaps are not penalized
Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment  [PDF]
Zu-Guo Yu,Xiao-Wen Zhan,Guo-Sheng Han,Roger W. Wang,Vo Anh,Ka Hou Chu
International Journal of Molecular Sciences , 2010, DOI: 10.3390/ijms11031141
Abstract: A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the “distances” are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old “distance” and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.
The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S rRNA Gene-Based Studies  [PDF]
Patrick D. Schloss
PLOS Computational Biology , 2010, DOI: 10.1371/journal.pcbi.1000844
Abstract: Pyrosequencing of PCR-amplified fragments that target variable regions within the 16S rRNA gene has quickly become a powerful method for analyzing the membership and structure of microbial communities. This approach has revealed and introduced questions that were not fully appreciated by those carrying out traditional Sanger sequencing-based methods. These include the effects of alignment quality, the best method of calculating pairwise genetic distances for 16S rRNA genes, whether it is appropriate to filter variable regions, and how the choice of variable region relates to the genetic diversity observed in full-length sequences. I used a diverse collection of 13,501 high-quality full-length sequences to assess each of these questions. First, alignment quality had a significant impact on distance values and downstream analyses. Specifically, the greengenes alignment, which does a poor job of aligning variable regions, predicted higher genetic diversity, richness, and phylogenetic diversity than the SILVA and RDP-based alignments. Second, the effect of different gap treatments in determining pairwise genetic distances was strongly affected by the variation in sequence length for a region; however, the effect of different calculation methods was subtle when determining the sample's richness or phylogenetic diversity for a region. Third, applying a sequence mask to remove variable positions had a profound impact on genetic distances by muting the observed richness and phylogenetic diversity. Finally, the genetic distances calculated for each of the variable regions did a poor job of correlating with the full-length gene. Thus, while it is tempting to apply traditional cutoff levels derived for full-length sequences to these shorter sequences, it is not advisable. Analysis of β-diversity metrics showed that each of these factors can have a significant impact on the comparison of community membership and structure. Taken together, these results urge caution in the design and interpretation of analyses using pyrosequencing data.
An enhanced RNA alignment benchmark for sequence alignment programs
Andreas Wilm, Indra Mainz, Gerhard Steger
Algorithms for Molecular Biology , 2006, DOI: 10.1186/1748-7188-1-19
Abstract: The RNA sequence sets in the benchmark database are taken from an increased number of RNA families to avoid unintended impact by using only a few families. The size of sets varies from 2 to 15 sequences to assess the influence of the number of sequences on program performance. Alignment quality is scored by two measures: one takes into account only nucleotide matches, the other measures structural conservation. The performance order of parameters – like nucleotide substitution matrices and gap-costs – as well as of programs is rated by rank tests.Most sequence alignment programs perform equally well on RNA sequence sets with high sequence identity, that is with an average pairwise sequence identity (APSI) above 75 %. Parameters for gap-open and gap-extension have a large influence on alignment quality lower than APSI ≤ 75 %; optimal parameter combinations are shown for several programs. The use of different 4 × 4 substitution matrices improved program performance only in some cases. The performance of iterative programs drastically increases with increasing sequence numbers and/or decreasing sequence identity, which makes them clearly superior to programs using a purely non-iterative, progressive approach. The best sequence alignment programs produce alignments of high quality down to APSI > 55 %; at lower APSI the use of sequence+structure alignment programs is recommended.Correctly aligning RNAs in terms of sequence and structure is a notoriously difficult problem.Unfortunately, the solution proposed by Sankoff [1] 20 years ago has a complexity of O(n3m) in time, and O(n2m) in space, for m sequences of length n. Thus, most structure alignment programs (e.g. DYNALIGN [2], FOLDALIGN [3], PMCOMP [4], or STEMLOC [5]) implement lightweight variants of Sankoff's algorithm, but are still computationally demanding. Consequently, researchers often create an initial sequence alignment that is afterwards corrected manually or by the aid of RNA alignment editors (e. g. CONSTR
Page 1 /100
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.