%0 Journal Article %T Vertebrate gene finding from multiple-species alignments using a two-level strategy %A David Carter %A Richard Durbin %J Genome Biology %D 2006 %I BioMed Central %R 10.1186/gb-2006-7-s1-s6 %X We describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set.We present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification.Gene finding can usefully be viewed as a two-level task. At the lower or local level there is a classification task: one of assigning probability estimates to potential features such as splice sites and coding start and stop sites on the basis of sequence information associated with each potential feature. At the higher or global level, on the other hand, we have a structure-building task: finding the most probable way(s) to combine potential features into exons, transcripts and genes. Classification and structure building are very different tasks, and although a gene finder can be based on a single formalism, such as hidden Markov models (HMMs) [1,2], there is no reason to assume that the same technique will be optimal for both tasks. Although HMMs seem to offer a good basis for structure building, they impose independence assumptions that are not particularly well suited to feature classification; formalisms such as neural networks [3,4], maximum entropy modeling [5], Bayesian networks [6-8], support vector machines [9-11] and relevance vector machines (RVMs) [12-14] provide alternativ %U http://genomebiology.com/2006/7/S1/S6