%0 Journal Article
%T The effect of sequencing errors on metagenomic gene prediction
%A Katharina J Hoff
%J BMC Genomics
%D 2009
%I BioMed Central
%R 10.1186/1471-2164-10-520
%X In this study, Sanger and pyrosequencing reads were simulated on the basis of models that take all types of sequencing errors into account. All metagenomic gene prediction tools showed decreasing accuracy with increasing sequencing error rates. Performance results on an established metagenomic benchmark dataset are also reported. In addition, we demonstrate that ESTScan, a tool for sequencing error compensation in eukaryotic expressed sequence tags, outperforms some metagenomic gene prediction tools on reads with high error rates although it was not designed for the task at hand.This study fills an important gap in metagenomic gene prediction research. Specialized methods are evaluated and compared with respect to sequencing error robustness. Results indicate that the integration of error-compensating methods into metagenomic gene prediction tools would be beneficial to improve metagenome annotation quality.Metagenomes are analyzed through simultaneous sequencing of all species in a microbial community without prior cultivation under laboratory conditions. The result is usually a large collection of sequencing reads from many species, and the phylogenetic origin of each read is unknown. A major goal in all metagenomic studies is the identification of potential protein functions and metabolic pathways. Reliable gene predictions are the basis for correct functional annotation, and for the discovery of new genes with their functions.Several gene prediction methods have been developed for the ab initio identification of protein coding genes in complete microbial genomes (e.g. GLIMMER and GeneMark [1,2]). These methods require an initial training phase on some data from the target genome, or training on the genome of a closely related species. Such conventional gene finders can in principle be applied to metagenomic data, given that single sequencing reads can be assembled into longer contigs in order to provide sufficient training data. The applicability of conventional
%U http://www.biomedcentral.com/1471-2164/10/520