Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
A Semi-Quantitative, Synteny-Based Method to Improve Functional Predictions for Hypothetical and Poorly Annotated Bacterial and Archaeal Genes  [PDF]
Alexis P. Yelton,Brian C. Thomas,Sheri L. Simmons,Paul Wilmes,Adam Zemla,Michael P. Thelen,Nicholas Justice,Jillian F. Banfield
PLOS Computational Biology , 2011, DOI: 10.1371/journal.pcbi.1002230
Abstract: During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence in all 634 available Archaeal and Bacterial genomes from the NCBI database and four newly assembled genomes of uncultivated Archaea from an acid mine drainage (AMD) community. In parallel, we established and modeled the trend between synteny and functional relatedness in the 118 genomes available in the STRING database. By combining these models, we developed a gene functional annotation method that weights evolutionary distance to estimate the probability of functional associations of syntenous proteins between genome pairs. The method was applied to the hypothetical proteins and poorly annotated genes in newly assembled acid mine drainage Archaeal genomes to add or improve gene annotations. This is the first method to assign possible functions to poorly annotated genes through quantification of the probability of gene functional relationships based on synteny at a significant evolutionary distance, and has the potential for broad application.
Improving pan-genome annotation using whole genome multiple alignment
Samuel V Angiuoli, Julie C Dunning Hotopp, Steven L Salzberg, Hervé Tettelin
BMC Bioinformatics , 2011, DOI: 10.1186/1471-2105-12-272
Abstract: We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review.Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.Advances in genome sequencing technologies have enabled sequencing of thousands of microbial genomes [1]. Often a single reference genome is insufficient to describe the genetic diversity of a species, leading to sequencing of many closely related isolates and subsequent comparative analysis. To aid in the analysis, an annotation process is typically performed using computational methods that include prediction of genes and their functions. Gene prediction algorithms for prokaryotes have been shown to perform well with relatively low error rates [2-4]. Limitations of gene prediction include accurate identification of the translation initiation start (TIS) sites and pseudogenes, and over-annotation in GC-rich genomes [5]. Specialized tools have addressed these issues, such as for improved TIS prediction [6]. In addition, post-processing can be used to identify annotation anomalies, as in GenePrimp [7].While there are several tools for gene prediction of single genomes
Missing genes in the annotation of prokaryotic genomes
Andrew S Warren, Jeremy Archuleta, Wu-chun Feng, Jo?o Setubal
BMC Bioinformatics , 2010, DOI: 10.1186/1471-2105-11-131
Abstract: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.Genome annotation is a crucial step for the extraction of useful information from genomes. Yet, despite more than a decade of intensive efforts directed at improving annotation tools and at obtaining new experimental results, available annotations still suffer from a number of serious problems. The main problems regarding protein-coding genes, found in every single genome, include [1-3]: the presence of numerous bona-fide genes without any functional assignment (the so-called "hypothetical genes"); the presence of genes that are mis-annotated or with annotations that are too general to be of any use; and the possible existence of real genes that have gone un
Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome
William A Moskal, Hank C Wu, Beverly A Underwood, Wei Wang, Christopher D Town, Yongli Xiao
BMC Genomics , 2007, DOI: 10.1186/1471-2164-8-18
Abstract: 1,071 un-annotated loci were targeted by RACE, and full length sequence coverage was obtained for 35% of the targeted genes. We have verified the structure and expression of 378 genes that were not present within the most recent release of the Arabidopsis genome annotation. These 378 genes represent a structurally diverse set of transcripts and encode a functionally diverse set of proteins.We have investigated the accuracy of the Twinscan and EuGene gene prediction programs and found them to be reliable predictors of gene structure in Arabidopsis. Several hundred previously un-annotated genes were validated by this work. Based upon this information derived from these efforts it is likely that the Arabidopsis genome annotation continues to overlook several hundred protein coding genes.A complete annotated genome sequence of Arabidopsis thaliana was released by the Arabidopsis Genome Initiative (AGI) in the year 2000, the first completed plant genome[1]. Since then, our understanding of the Arabidopsis genome structure and transcriptome has been improved through the release of 4 sequential updates to the annotation, culminating in The Institute for Genomic Research's release 5 (TIGR5), which forms the basis of the work presented here. Following the TIGR5 annotation release, responsibility for maintaining and updating the Arabidopsis annotation was turned over to The Arabidopsis Information Resource (TAIR), which has since released version 6 of the Arabidopsis annotation (TAIR6). Over the course of the TIGR annotation releases, the number of annotated protein-coding genes of Arabidopsis has increased from 25,498 (a number that included transposons and pseudogenes) to a final total of 26,207 protein coding genes plus 3,786 regions annotated as transposon-related or other pseudogenes in the final TIGR release. At the same time, the size of the Arabidopsis pseudomolecules has increased from 115 MB in the initial 2000 release, to 119 MB in TIGR5 due to the inclusion of add
Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1  [PDF]
Wook Kim,Mark W. Silby,Sam O. Purvine,Julie S. Nicoll,Kim K. Hixson,Matt Monroe,Carrie D. Nicora,Mary S. Lipton,Stuart B. Levy
PLOS ONE , 2012, DOI: 10.1371/journal.pone.0008455
Abstract: Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.
Annotation and evolutionary relationships of a small regulatory RNA gene micF and its target ompF in Yersinia species
Nicholas Delihas
BMC Microbiology , 2003, DOI: 10.1186/1471-2180-3-13
Abstract: Alignment and search methods using NCBI BLAST programs have been used to identify micF, ompF and ompC in Yersinia pestis and Yersinia enterocolitica. By alignment with DNA sequences from other bacterial species, 5' start sites of genes and upstream transcriptional regulatory sites in promoter regions were predicted. Annotated genes from Yersinia species provide phylogenetic information on the micF regulatory system. High sequence conservation in binding sites of transcriptional regulatory factors are found in the promoter region upstream of micF and conservation in blocks of sequences as well as marked sequence variation is seen in segments of the micF RNA gene. Unexpected large differences in rates of evolution were found between the interacting RNA transcripts, micF RNA and the 5' UTR of the ompF mRNA. micF RNA/ompF mRNA 5' UTR duplex structures were modeled by the mfold program. Functional domains such as RNA/RNA interacting sites appear to display a minimum of evolutionary drift in sequence with the exception of a significant change in Y. enterocolitica micF RNA.Newly annotated Yersinia micF and ompF genes and the resultant RNA/RNA duplex structures add strong phylogenetic support for a generalized duplex model. The alignment and search approach using 5' UTR signatures may be a model to help define other genes and their start sites when annotated genes are available in well-defined reference organisms.The rapid determination of microbial genomic sequences poses a challenge in gene annotation and assignment of transcriptional start sites. Without experimental data, incorrect annotations can be made as well as erroneous determination of gene start sites. This is especially true for genes that are evolutionarily and structurally related such as the bacterial porin genes, ompF and ompC. For example, a BLAST search using the Salmonella typhimurium ompF gene sequence identifies Enterobacter cloacae ompC as well as Salmonella minnesota ompC (unpublished). However, when
Proteome driven re-evaluation and functional annotation of the Streptococcus pyogenes SF370 genome
Akira Okamoto, Keiko Yamada
BMC Microbiology , 2011, DOI: 10.1186/1471-2180-11-249
Abstract: Nine proteins encoded by novel ORFs were found by shotgun proteomic analysis, and their specific mRNAs were verified by reverse transcriptional PCR (RT-PCR). We also provided functional annotations for hypothetical genes using proteomic analysis from three different culture conditions that were separated into three fractions: supernatant, soluble, and insoluble. Consequently, we identified 567 proteins on re-evaluation of the proteomic data using an in-house database comprising 1,697 annotated and nine non-annotated CDSs. We provided functional annotations for 126 hypothetical proteins (18.9% out of the 668 hypothetical proteins) based on their cellular fractions and expression profiles under different culture conditions.The list of amino acid sequences that were annotated by genome analysis contains outdated information and unrecognized protein-coding sequences. We suggest that the six-frame database derived from actual DNA sequences be used for reliable proteomic analysis. In addition, the experimental evidence from functional proteomic analysis is useful for the re-evaluation of previously sequenced genomes.Comprehensive molecular biological approaches, including genome, transcriptome, proteome, and metabolome analyses are powerful, essential tools for understanding the phenotype of all living organisms. In recent years, high-throughput DNA sequencing technologies have enabled the sequencing of a microbial genome in a few days. However, the identification, annotation, and curation of genes have been limiting factors in the analysis of new genomes. The criteria for identifying and annotating genes depend on the curator. Usually, curators should annotate all open reading frames (ORFs) based on the features of promoter regions, such as the presence or absence of Shine-Dalgarno sequences, and based on homology searches with nucleic acid databases. Moreover, databases such as NCBInr in the National Center of Biotechnology Information (NCBI) have been updated, although
Automatic annotation of eukaryotic genes, pseudogenes and promoters
Victor Solovyev, Peter Kosarev, Igor Seledsov, Denis Vorobyev
Genome Biology , 2006, DOI: 10.1186/gb-2006-7-s1-s10
Abstract: The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software.We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.The successful completion of the Human Genome Project has demonstrated that large-scale sequencing projects can generate high-quality data at a reasonable cost. In addition to the human genome, researchers have already sequenced the genomes of a number of important model organisms that are commonly used as test beds in studying human biology. These are chimpanzee, mouse, rat, two puffer fish, two fruit flies, two sea squirts, two roundworms, and baker's yeast.Currently, sequencing centers are close to completing working drafts of the genomes of chicken, dog, honey bee, sea urchin and a set of four fungi, and variety of other genomes are currently in the sequencing pipelines [1].Many new genomes lack such rich experimental information as the human genome and, therefore, their initial computational annotation is even more important as a starting point for further research to uncover their biology. The more comprehensive and accurate are such computational analyses, the less time-consuming and costly experimental work will have to be done to determine all functi
Genome re-annotation: a wiki solution?
Steven L Salzberg
Genome Biology , 2007, DOI: 10.1186/gb-2007-8-1-102
Abstract: So you think that gene you just retrieved from GenBank [1] is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.Large-scale genome sequencing has revolutionized biology over the past ten years, generating vast amounts of new information that has radically transformed our understanding of hundreds of species, including ourselves. Sequencing centers continue to churn out new DNA sequences for a fantastic variety of species, covering more and more of the tree of life. Along with these sequences, the centers also produce genome annotation, which includes the locations and descriptions of all identifiable genes. These gene lists are the first pictures we get of what's inside a newly sequenced genome, and they can reveal key insights into what makes an organism distinctive. Sometimes the gene lists themselves are part of the story; for example, when the human genome was published [2,3], the headline was that humans have 'only' 25,000 genes, in contrast to earlier estimates of 100,000 or more. For many microbial species, the genome helps us to understand how the organism can accomplish something particularly difficult, such as how Deinococcus radiodurans (to cite just one of many examples) can withstand exposure to radiation levels far in excess of what a human could tolerate [4]. With each new human pathogen, the gene list helps us determine how the organism infects humans, how it causes sickness and (sometimes) how it becomes resistant to antibiotics. For these and other reasons, the accuracy of the gene list is tremendously important.Before addressing the problems with annotation, I will first summarize how it is done. The process of sequencing and annotating the DNA of a bacterial species has become highly automated in
Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae
Wanwipa Vongsangnak, Peter Olsen, Kim Hansen, Steen Krogsgaard, Jens Nielsen
BMC Genomics , 2008, DOI: 10.1186/1471-2164-9-245
Abstract: Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources.A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology.A. oryzae is a member of the diverse group of aspergilli that includes species that are important microbial cell factories, as well as species that are human and plant pathogens [1]. A. oryzae has been used safely in the fermentation industry for hundreds of years in the production of soy sauce, miso and sake. Today A. oryzae is also used for production of a wide range of different fungal enzymes such as α-amylase, glucoamylase, lipase and protease and it is regarded as an ideal host for the synthesis of proteins of eukaryotic origin [1]. In the post genome-sequencing era, various high-throughput technologies have been developed to characterize biological systems on the geno
Page 1 /100
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.