|
BMC Bioinformatics 2008
Universal seeds for cDNA-to-genome comparisonAbstract: Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98–99% of the optimal seeds, and thus represent universal good seeds.We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.The next few years are expected to bring a significant increase in the number of available genomes, driven by advances in sequencing technologies [1]. As genome sequencing projects outpace the generation of native mRNA and protein sequences, gene annotation projects for these genomes will need to rely instead on cDNA information from other species. While existing alignment programs align cDNA and the corresponding genomic sequences accurately, they are inadequate for cross-species comparisons [2]. Beginning with blast [3,4], most alignment programs have used a seed-and-extend technique to produce local alignments, starting from exact or near-exact word matches (seeds) between the two sequences and extending them to a local alignment in several stages. Blast uses an exact match of 11 contiguous positions, represented by a vector of 1s (11c = 11111111111). Such a seed is called continuous. More recently, spaced seeds have been introduced, which allow wildcard positions in the seed pattern, marked with 0s. For instance, Kent and Zahler [5] used a seed that allowed for mismatches at the wobble codon positio
|