Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
Splign: algorithms for computing spliced alignments with identification of paralogs
Yuri Kapustin, Alexander Souvorov, Tatiana Tatusova, David Lipman
Biology Direct , 2008, DOI: 10.1186/1745-6150-3-20
Abstract: We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time.Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes.This article was reviewed by: Steven Salzberg, Arcady Mushegian and Andrey Mironov (nominated by Mikhail Gelfand).Spliced gene products available in the form of cDNA sequences provide an experimental level of support for gene models. It has been shown [1] that the availability of large numbers of such sequences greatly improves the quality of identification of gene structures, especially in UTR regions which are beyond the application scope of most ab initio gene-prediction methods. Accuracy of spliced alignments is crucial in such areas as studies of alternative splicing and regulatory elements.Over the last decade, significant attention has been given to development of tools to assist the spliced alignment problem. A useful overview of such tools has been given in [2]. Despite considerable progress in more recent tools, various types of alignment errors are still observed. Such errors include missing micro-exons, forced consensus splice signals and alignments stretching over several members of tandem gene clusters. Another critical i
Functional and evolutionary analysis of alternatively spliced genes is consistent with an early eukaryotic origin of alternative splicing
Manuel Irimia, Jakob Rukov, David Penny, Scott Roy
BMC Evolutionary Biology , 2007, DOI: 10.1186/1471-2148-7-188
Abstract: For each species, we find that genes from most functional categories are alternatively spliced. Ancient genes (shared between animals, fungi and plants) show high levels of alternative splicing. Genes with products expressed in the nucleus or plasma membrane are generally more alternatively spliced while those expressed in extracellular location show less alternative splicing. We find a clear correspondence between incidence of alternative splicing and intron number per gene both within and between genomes. In general, we find several similarities in patterns of alternative splicing across these diverse eukaryotes.Along with previous studies indicating intron-rich genes with weak intron boundary consensus and complex spliceosomes in ancestral organisms, our results suggest that at least a simple form of alternative splicing may already have been present in the unicellular ancestor of plants, fungi and animals. A role for alternative splicing in the evolution of multicellularity then would largely have arisen by co-opting the preexisting process.Alternative splicing (AS) of transcripts is common in diverse eukaryotic lineages. By this mechanism, a variety of transcripts and proteins are produced from a single gene, contributing to increased transcriptome and proteome diversity. AS has been reported in a wide range of eukaryotic groups including plants, apicomplexans, diatoms, amoebae, animals and fungi [1-5]. However, it is unclear and hard to assess whether this process has arisen independently in the different lineages (as suggested by some authors, e.g. [6]) or whether it was already present in their last common ancestor. The spliceosome, the machinery responsible for the splicing of introns in eukaryotic genes, is ancestral to all extant eukaryotic groups with the last common ancestor possessing a complex machinery, similar to that found in most modern organisms [7]. In addition, we recently argued that eukaryotic ancestors had weak 5' splice site boundary consen
EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome
Fran?oise Thibaud-Nissen, Matthew Campbell, John P Hamilton, Wei Zhu, C Buell
BMC Genomics , 2007, DOI: 10.1186/1471-2164-8-388
Abstract: We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website http://rice.tigr.org webcite, as well as in the Community Annotation track of the Genome Browser.We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at http://sourceforge.net/projects/eucap/ webcite.Accurate and consistent annotation of genomes presents a challenge that can be partially solved by automated and semi-automated annotation methods. Improvements in the structural annotation of gene models can be obtained through training of ab initio gene finders and, for eukaryotes, through empirical transcript support in the form of Expressed Sequence Tags (ESTs) and, more critically, full-length cDNAs [1-3]. In addition to structural annotation, in large-scale genome annotation projects functional annotation is performed in an a
Cross-species EST alignments reveal novel and conserved alternative splicing events in legumes
Bing-Bing Wang, Mike O'Toole, Volker Brendel, Nevin D Young
BMC Plant Biology , 2008, DOI: 10.1186/1471-2229-8-17
Abstract: Based on cognate EST alignments alone, the observed frequency of alternatively spliced genes is lower in Mt (~10%, 1,107 genes) and Lj (~3%, 92 genes) than in Arabidopsis and rice (both around 20%). However, AS frequencies are comparable in all four species if EST levels are normalized. Intron retention is the most common form of AS in all four plant species (~50%), with slightly lower frequency in legumes compared to Arabidopsis and rice. This differs notably from vertebrates, where exon skipping is most common. To uncover additional AS events, we aligned ESTs from other legume species against the Mt genome sequence. In this way, 248 additional Mt genes were predicted to be alternatively spliced. We also identified 22 AS events completely conserved in two or more plant species.This study extends the range of plant taxa shown to have high levels of AS, confirms the importance of intron retention in plants, and demonstrates the utility of using ESTs from related species in order to identify novel and conserved AS events. The results also indicate that the frequency of AS in plants is comparable to that observed in mammals. Finally, our results highlight the importance of normalizing EST levels when estimating the frequency of alternative splicing.Alternative splicing (AS) is an important cellular process that leads to multiple mRNA isoforms from a single pre-mRNA in eukaryotic organisms. Plant AS events used to be regarded as rare. However, a growing number of computational studies have now demonstrated that the frequency of alternatively spliced genes in plants is higher than previously estimated [1,2]. 20–30% of expressed genes are alternatively spliced in Arabidopsis thaliana (At) and rice (Oryza sativa, Os) as revealed by large scale EST-genome alignments [1,2]. A recent study using EST pairs gapped alignments (EST-EST) surveyed 11 plant species and suggested that overall AS frequencies vary greatly in different plant species, with some rates comparable to those
Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts
Paul Flicek, Michael R Brent
Genome Biology , 2006, DOI: 10.1186/gb-2006-7-s1-s8
Abstract: MARS uses the mouse, rat, dog, opossum, chicken, and frog genome sequences as pairwise informant sources for Twinscan and combines the resulting transcript predictions into genes based on coding (CDS) region overlap. Based on the EGASP assessment, MARS is one of the more accurate dual-genome prediction programs. Compared to the GENCODE annotation, we find that predictive sensitivity increases, while specificity decreases, as more informant species are used. MARS correctly predicts alternatively spliced transcripts for 11 of the 236 multi-exon GENCODE genes that are alternatively spliced in the coding region of their transcripts. For these genes a total of 24 correct transcripts are predicted.The MARS algorithm is able to predict alternatively spliced transcripts without the use of expressed sequence information, although the number of loci in which multiple predicted transcripts match multiple alternatively spliced transcripts in the GENCODE annotation is relatively small.Accurate prediction of protein-coding genes in mammals remains a challenging and active area of research [1]. In the past decade the most important advance in de novo gene prediction came with the initial availability of extensive human and mouse genomic sequences. Several gene prediction algorithms were introduced at that time that improved gene prediction by using the specific patterns of evolutionary conservation that are indicative of protein coding genes [2-4].All of the dual-genome (category 4) gene finders participating in EGASP rely on alignments to one or more informant genome sequences. For predicting human genes, dual-genome gene prediction algorithms most often use the mouse genome sequence as a source of evolutionary conservation information. This was originally a consequence of the early availability, with respect to other mammals, of the mouse genome sequence [5-8]. However, as additional genomes were sequenced, it became apparent that the evolutionarily divergence between human and
De novo reconstruction of the Toxoplasma gondii transcriptome improves on the current genome annotation and reveals alternatively spliced transcripts and putative long non-coding RNAs  [cached]
Hassan Musa A,Melo Mariane B,Haas Brian,Jensen Kirk D C
BMC Genomics , 2012, DOI: 10.1186/1471-2164-13-696
Abstract: Background Accurate gene model predictions and annotation of alternative splicing events are imperative for genomic studies in organisms that contain genes with multiple exons. Currently most gene models for the intracellular parasite, Toxoplasma gondii, are based on computer model predictions without cDNA sequence verification. Additionally, the nature and extent of alternative splicing in Toxoplasma gondii is unknown. In this study, we used de novo transcript assembly and the published type II (ME49) genomic sequence to quantify the extent of alternative splicing in Toxoplasma and to improve the current Toxoplasma gene annotations. Results We used high-throughput RNA-sequencing data to assemble full-length transcripts, independently of a reference genome, followed by gene annotation based on the ME49 genome. We assembled 13,533 transcripts overlapping with known ME49 genes in ToxoDB and then used this set to; a) improve the annotation in the untranslated regions of ToxoDB genes, b) identify novel exons within protein-coding ToxoDB genes, and c) report on 50 previously unidentified alternatively spliced transcripts. Additionally, we assembled a set of 2,930 transcripts not overlapping with any known ME49 genes in ToxoDB. From this set, we have identified 118 new ME49 genes, 18 novel Toxoplasma genes, and putative non-coding RNAs. Conclusion RNA-seq data and de novo transcript assembly provide a robust way to update incompletely annotated genomes, like the Toxoplasma genome. We have used RNA-seq to improve the annotation of several Toxoplasma genes, identify alternatively spliced genes, novel genes, novel exons, and putative non-coding RNAs.
Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA
Sarah Djebali, Franck Delaplace, Hugues Crollius
Genome Biology , 2006, DOI: 10.1186/gb-2006-7-s1-s7
Abstract: We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts.We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement.Ideally, the process of annotating protein coding genes (hereby referred to as 'genes') in a region of genomic DNA involves locating the exact external and internal boundaries of all the genes it includes and, for each, finding all the possible transcript variants. In practice, achieving this is very difficult in eukaryotic genomes for many reasons. First, eukaryotic genes are generally composed of a succession of exons and introns, which makes their structure complex and highly variable. Second, genes cover only a small fraction of eukaryotic genomes (30% in mammals) and exons cover an even lower fraction (1% to 2% in mammals). Third, some eukaryotic genomes contain many pseudogenes, which are non-functional copies of genes sometimes nested within genes and with similar compositions. Finally, each gene may give rise to many different transcripts, often with minor variations, a mechanism that modulates the function or the spatial or temporal availability of the corresponding protein. Despite these difficulties, precise gene annotation is crucial for biomedical research: it is a basic requirement to link genotype and phenotypes in human and model species and generally to focus the w
Automatic extraction of reliable regions from multiple sequence alignments  [cached]
Lassmann Timo,Sonnhammer Erik LL
BMC Bioinformatics , 2007, DOI: 10.1186/1471-2105-8-s5-s9
Abstract: Background High quality multiple alignments are crucial in the transfer of annotation from one genome to another. Multiple alignment methods strive to achieve ever increasing levels of average accuracy on benchmark sets while the accuracy of individual alignments is often overlooked. Results We have previously developed a method to automatically assess the accuracy and overall difficulty of multiple alignments. This was achieved by a per-residue comparison between alternate alignments of the same sequences. Here we present a key extension to this method, an algorithm to extract similarly aligned regions from several alignments and merge them into a new consensus alignment. Conclusion We demonstrate that the fraction of correctly aligned residues within the resulting alignments is increased by 25 – 100 percent compared to the original input alignments, as only the most reliably aligned parts are considered.
Increasing Sequence Search Sensitivity with Transitive Alignments  [PDF]
Ketil Malde, Tomasz Furmanek
PLOS ONE , 2013, DOI: 10.1371/journal.pone.0054422
Abstract: Sequence alignment is an important bioinformatics tool for identifying homology, but searching against the full set of available sequences is likely to result in many hits to poorly annotated sequences providing very little information. Consequently, we often want alignments against a specific subset of sequences: for instance, we are looking for sequences from a particular species, sequences that have known 3d-structures, sequences that have a reliable (curated) function annotation, and so on. Although such subset databases are readily available, they only represent a small fraction of all sequences. Thus, the likelihood of finding close homologs for query sequences is smaller, and the alignments will in general have lower scores. This makes it difficult to distinguish hits to homologous sequences from random hits to unrelated sequences. Here, we propose a method that addresses this problem by first aligning query sequences against a large database representing the corpus of known sequences, and then constructing indirect (or transitive) alignments by combining the results with alignments from the large database against the desired target database. We compare the results to direct pairwise alignments, and show that our method gives us higher sensitivity alignments against the target database.
Gene recognition via spliced alignment
Todd Richmond
Genome Biology , 2000, DOI: 10.1186/gb-2000-1-1-reports233
Abstract: It is somewhat difficult to find the basic page for submitting sequences (Gene recognition via spliced alignment). The main page contains reference information and a simple explanation of how the program works. Once you locate the basic submission page, however, you can bookmark it separately. You can submit a genomic sequence up to 180,000 base pairs long and a maximum of 10 related protein sequences. There are only a few options to worry about. You can choose some parameters that the program uses for aligning the related proteins with the predicted protein, and select the minimum intron size you expect. You can also choose to specify whether or not you believe that the sequence being analysed contains a full gene, or one that is incomplete at either the 5' or 3' end. You can also specify the organism, though the choices are currently limited to human and mammalian, Drosophila, monocot plants, dicot plants or yeast. The site warns, however, that only the parameters for human and mammalian sequence have been extensively tested and optimized.Last updated 2 January 1997.The ability to use a related sequence to determine the gene structure for an unknown gene is a powerful tool. Even distantly related proteins can be extremely useful in predicting exons in unknown sequence. The program outputs a combined graphic showing the predicted gene structures from all related proteins submitted, as well as a separate table of exons, sequence alignments, and predicted protein sequence for each related sequence, with a confidence score for each related sequence.PROCRUSTES uses a very strict definition for splice sites, which can cause problems. The set of candidate exons is constructed by selection of all blocks between candidate acceptor and donor sites (that is, between an AG dinucleotide at an intron-exon boundary and a GU dinucleotide at an exon-intron boundary). As a result, if there are any deviations from this, the program will either fail to find the correct exons, or defi
Page 1 /100
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.