OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Algorithms for Molecular Biology 2010

Sequence embedding for fast construction of guide trees for multiple sequence alignment

DOI: 10.1186/1748-7188-5-21

Gordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, Desmond G Higgins

Full-Text Cite this paper Add to My Lib

Abstract:

In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz webcite.The majority of multiple sequence alignment (MSA) methods use some form of progressive alignment [1-7]. In progressive alignment the usual first step is to compute a pair-wise distance matrix which is then used to make a so called guide tree, in order to determine the order of alignment of the input sequences. The computation of the distance matrix requires N (N - 1)/2 pair-wise comparisons, N being the number of sequences. Construction of the guide tree, usually has an additional time complexity of (N2) to (N3), depending on the algorithm used and its implementation. The complexity of these steps can become prohibitive when N becomes very large e.g. when N is in the tens of thousands. There are very few multiple alignment programs that can handle datasets of this size, with MUSCLE and MAFFT being the most familiar [6,7]. Some of the most accurate multiple sequence alignment methods can only routinely handle sequences numbering in the hundreds [4,8,9]. The explosive growth in the number of sequences coming from genomic studies means that the ability to cluster and align greater numbers of sequences is becoming even more important. For example, the Ribosomal Database Project [10] Release 10 consists of more than a million sequences.In order to make very large guide trees, the fi

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133