All Title Author
Keywords Abstract


Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

DOI: 10.1186/1471-2105-7-62

Full-Text   Cite this paper   Add to My Lib

Abstract:

We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly.Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.Finding protein-coding genes in eukaryotic genomic sequences with in-silico methods remains an important challenge in computational genomics, despite many years of intensive research work. Existing methods fall into two groups with respect to the data they utilize. The first group consists of ab initio programs which use only the query genomic sequence as input. Examples are the programs GENSCAN [1], AUGUSTUS [2] and HMMGene [3] which are HMM-based and GENEID [4]. The second group of gene-finding methods, extrinsic methods, comprises all programs which use data other than the query genomic sequence. Some extrinsic methods use genomic sequences from other species. A cross-species comparison of genomic sequences

Full-Text

comments powered by Disqus