|
BMC Research Notes 2012
Efficient computation of spaced seedsKeywords: Similarity search, Local alignment, Spaced seed, Heuristic algorithm, Sensitivity Abstract: SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity.Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. This problem can be solved exactly using the dynamic programming algorithm of Smith-Waterman in quadratic time. Many instances, including all database searches, are too large for this approach to be feasible and heuristic algorithms are used instead [1,2]. The most widely used program in bioinformatics, BLAST [2,3], is one such tool. It uses the so-called "hit and extend" approach: a hit consists of 11 consecutive matches between two sequences and represents a potential local alignment. The hit is then extended both ways in search for similarity.It is clear that not all local alignments have to include an identical stretch of length 11. It has been already noticed in [4] and then again in [5] that requiring that the matches are not consecutive increases the chances of finding alignments. The idea of optimizing the way the required matches are placed has been investigated in [6,7], the latter having used it in a similarity search software, PatternHunter. Much work has been dedicated to spaced seeds. For a survey of earlier work, we refer the reader to [8].The 11 consecutive matches of BLAST are called a contiguous seed, denoted 11111111111 (for 11 consecutive matches), whereas the one of PatternHunter is a spaced seed, 111*1**1*1**11*111; a 1 represents a match and * a don
|