Publish in OALib Journal

ISSN: 2333-9721

APC: Only $99


Any time

2019 ( 53 )

2018 ( 50 )

2017 ( 71 )

2016 ( 62 )

Custom range...

Search Results: 1 - 10 of 19237 matches for " Richard Durbin "
All listed articles are free for downloading (OA Articles)
Page 1 /19237
Display every page Item
Vertebrate gene finding from multiple-species alignments using a two-level strategy
David Carter, Richard Durbin
Genome Biology , 2006, DOI: 10.1186/gb-2006-7-s1-s6
Abstract: We describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set.We present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification.Gene finding can usefully be viewed as a two-level task. At the lower or local level there is a classification task: one of assigning probability estimates to potential features such as splice sites and coding start and stop sites on the basis of sequence information associated with each potential feature. At the higher or global level, on the other hand, we have a structure-building task: finding the most probable way(s) to combine potential features into exons, transcripts and genes. Classification and structure building are very different tasks, and although a gene finder can be based on a single formalism, such as hidden Markov models (HMMs) [1,2], there is no reason to assume that the same technique will be optimal for both tasks. Although HMMs seem to offer a good basis for structure building, they impose independence assumptions that are not particularly well suited to feature classification; formalisms such as neural networks [3,4], maximum entropy modeling [5], Bayesian networks [6-8], support vector machines [9-11] and relevance vector machines (RVMs) [12-14] provide alternativ
Enhanced protein domain discovery using taxonomy
Lachlan Coin, Alex Bateman, Richard Durbin
BMC Bioinformatics , 2004, DOI: 10.1186/1471-2105-5-56
Abstract: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques.Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions – such as domain co-occurrence and protein localisation.Protein domains are the structural, functional and evolutionary units of proteins. Several statistical techniques are currently used for detecting protein domains. In particular, Profile hidden Markov models (profile HMMs) have been successfully applied to this problem [1,2], and form the basis for databases such as Pfam [3]. Profile HMMs can be more sensitive than methods which look for pairwise homology [4]. Our ability to detect distant homology is limited by noise. This is due to the divergence of the amino acid sequence too far away from the profile to detect the similarity, despite the preservation of structure and function. We attempt to take into account extra information concerning the patterns of occurrence of domains in order to recognize distant homology. We have previously discovered that using the probabilities of domains occurring together in a sequence as contextual information significantly enhances domain detection [5]. In this paper we investigate using the species distribution of domains to enhance detection.Fig. 1 shows examples of domains which have biased taxonomic distribution. For exampl
[X]uniqMAP: unique gene sequence regions in the human and mouse genomes
José L Jiménez, Richard Durbin
BMC Genomics , 2006, DOI: 10.1186/1471-2164-7-249
Abstract: Taking advantage of the availability of complete genome sequence information for mouse and human, the most widely used systems for the study of mammalian genetics, we have built a database, [X]uniqMAP, that stores the precalculated unique regions for all transcripts of these two organisms. For each gene, the database discriminates between those unique regions that are shared by all transcripts and those exclusive to single transcripts. In addition, it also provides those unique regions that are shared between orthologous genes from the two organisms. The database is updated regularly to reflect changes in genome assemblies and gene builds.Over 85% of genes have unique regions at least 19 bases long, with the majority being unique over 60% of their lengths. 14482 human genes share exactly at least a unique region with mouse genes, though such regions are typically under 40 bases long. The full data are publicly accessible online both interactively and for download. They should facilitate (i) the design of probes, primers and siRNAs for both small- and large-scale projects; and (ii) the identification of regions for the design of oligos that could be re-used to target equivalent gene/transcripts from human and mouse.Following the completion of several whole genome sequencing projects a considerable effort has been focused on genome-wise functional analyses of a number of organisms (reviewed in [1]). Some of the most popular methods are the study of gene expression by microarrays and phenotypic analyses from gene knock-downs by means of RNA interference techniques [2,3]. The success of these methods relies in the ability of reagent oligonucleotides to specifically recognise single species of transcripts within the complex mixture present in the studied cells. Therefore, when designing probes, primers and siRNAs, the sequence specificity of candidate oligonucleotides must be assessed in order to minimise potential cross-hybridisations and off-target effects [4,5]. Altho
A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies
Oliver Stegle ,Leopold Parts ,Richard Durbin,John Winn
PLOS Computational Biology , 2010, DOI: 10.1371/journal.pcbi.1000770
Abstract: Gene expression measurements are influenced by a wide range of factors, such as the state of the cell, experimental conditions and variants in the sequence of regulatory regions. To understand the effect of a variable of interest, such as the genotype of a locus, it is important to account for variation that is due to confounding causes. Here, we present VBQTL, a probabilistic approach for mapping expression quantitative trait loci (eQTLs) that jointly models contributions from genotype as well as known and hidden confounding factors. VBQTL is implemented within an efficient and flexible inference framework, making it fast and tractable on large-scale problems. We compare the performance of VBQTL with alternative methods for dealing with confounding variability on eQTL mapping datasets from simulations, yeast, mouse, and human. Employing Bayesian complexity control and joint modelling is shown to result in more precise estimates of the contribution of different confounding factors resulting in additional associations to measured transcript levels compared to alternative approaches. We present a threefold larger collection of cis eQTLs than previously found in a whole-genome eQTL scan of an outbred human population. Altogether, 27% of the tested probes show a significant genetic association in cis, and we validate that the additional eQTLs are likely to be real by replicating them in different sets of individuals. Our method is the next step in the analysis of high-dimensional phenotype data, and its application has revealed insights into genetic regulation of gene expression by demonstrating more abundant cis-acting eQTLs in human than previously shown. Our software is freely available online at http://www.sanger.ac.uk/resources/softwa?re/peer/.
Clustering of phosphorylation site recognition motifs can be exploited to predict the targets of cyclin-dependent kinase
Alan M Moses, Jean-Karim Hériché, Richard Durbin
Genome Biology , 2007, DOI: 10.1186/gb-2007-8-2-r23
Abstract: Protein kinases are ubiquitous components of cellular signalling networks [1]. A relatively well understood example is the network that controls progression of the cell cycle, where cyclin-dependent kinases (CDKs) couple with various cyclins over the cell cycle to regulate critical processes [2-4]. Despite their biological and medical importance, relatively few direct, in vivo targets of these kinases have been identified conclusively, because experimental techniques are difficult and time consuming [1,5]. With the availability of databases of protein sequences, computational methods provide an alternative approach [6,7].Kinase substrates often have short, degenerate sequence motifs surrounding the phosphorylated residue [8]. Putative target residues can be predicted by searching for matches to the consensus for a particular kinase. For example, CDK substrates often contain S/T-P-X-R/K where X represents any amino acid, and S/T represents the phosphorylated serine or threonine [9,10]. Because of the low specificity of the CDK consensus, however, databases of protein sequences are expected to contain large numbers of matches by chance. Therefore, many of the matches in protein sequences are likely to be false-positive predictions. Consistent with this, when 553 Saccharomyces cerevisiae proteins with at least one match to the CDK consensus were tested in a high-throughput kinase assay, only 32% (178) were found to be substrates [11]. Furthermore, in some cases characterized CDK substrates are phosphorylated at residues matching only a minimal consensus S/T-P [12]; considering these weak matches would probably lead to even larger numbers of false positives.Characterized CDK targets may be phosphorylated at multiple residues (for instance, see the report by Lees and coworkers [13]). Recent studies of several CDK target proteins in S. cerevisiae have shown that these multiple phosphorylations can regulate stability [12], protein interaction [14,15], or localization [16].
Joint Genetic Analysis of Gene Expression Data with Inferred Cellular Phenotypes
Leopold Parts equal contributor ,Oliver Stegle equal contributor,John Winn,Richard Durbin
PLOS Genetics , 2011, DOI: 10.1371/journal.pgen.1001276
Abstract: Even within a defined cell type, the expression level of a gene differs in individual samples. The effects of genotype, measured factors such as environmental conditions, and their interactions have been explored in recent studies. Methods have also been developed to identify unmeasured intermediate factors that coherently influence transcript levels of multiple genes. Here, we show how to bring these two approaches together and analyse genetic effects in the context of inferred determinants of gene expression. We use a sparse factor analysis model to infer hidden factors, which we treat as intermediate cellular phenotypes that in turn affect gene expression in a yeast dataset. We find that the inferred phenotypes are associated with locus genotypes and environmental conditions and can explain genetic associations to genes in trans. For the first time, we consider and find interactions between genotype and intermediate phenotypes inferred from gene expression levels, complementing and extending established results.
A systematic comparative and structural analysis of protein phosphorylation sites based on the mtcPTM database
José L Jiménez, Bj?rn Hegemann, James RA Hutchins, Jan-Michael Peters, Richard Durbin
Genome Biology , 2007, DOI: 10.1186/gb-2007-8-5-r90
Abstract: In recent years, several sequencing projects have revealed the complete transcriptomes and proteomes for a number of organisms, including human [1,2]. The current challenge is to place this information within the dynamic context of the cell in order to elucidate how individual molecules interact to achieve the complex behavior of cellular processes, which translates into the ability of living organisms to adapt and thrive in a myriad of environments and conditions. Thus, much effort has been invested in identifying, for example, the transcription patterns of genes and the interacting partners of proteins in order to determine the connections that establish the intricate cellular pathways [3,4]. To understand these networks fully, however, we must also comprehend how their connections are regulated when the states of individual components are altered, for example by means of post-translational modifications (PTMs). It is therefore crucial to identify which proteins can be modified as well as the effect and lifetime of the PTMs.Among PTMs, reversible protein phosphorylation is known to play a key role in regulating a variety of processes in eukaryotes, from the cell division cycle to neuronal plasticity [5,6]. The most commonly observed phosphorylations affect serine, threonine, and tyrosine residues [7,8], although phosphorylation of histidines and aspartates has also been reported (for review [9]). Protein phosphorylation is catalyzed by enzymes called protein kinases, which are usually specific for either tyrosine or serine/threonine, with few of them being able to modify all three residues indistinguishably [10-12]. The human genome encodes 518 protein kinases [13,14], and recent estimates suggest that around one-third of cellular proteins could undergo phosphorylation [15]. Despite the progress made during the past few decades, our knowledge about regulation of protein function by phosphorylation and the basis of kinase specificity remains incomplete, mainly beca
The Sequence Ontology: a tool for the unification of genome annotations
Karen Eilbeck, Suzanna E Lewis, Christopher J Mungall, Mark Yandell, Lincoln Stein, Richard Durbin, Michael Ashburner
Genome Biology , 2005, DOI: 10.1186/gb-2005-6-5-r44
Abstract: Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. Unfortunately, biological terminology is notoriously ambiguous; the same word is often used to describe more than one thing and there are many dialects. For example, does a coding sequence (CDS) contain the stop codon or is the stop codon part of the 3'-untranslated region (3' UTR)? There really is no right or wrong answer to such questions, but consistency is crucial when attempting to compare annotations from different sources, or even when comparing annotations performed by the same group over an extended period of time.At present, GenBank [1] houses 220 viral genomes, 152 bacterial genomes, 20 eukaryotic genomes and 18 archeal genomes. Other centers such as The Institute for Genomic Research (TIGR) [2] and the Joint Genome Institute (JGI) [3] also maintain and distribute annotations, as do many model organism databases such as FlyBase [4], WormBase [5], The Arabidopsis Information Resource (TAIR) [6] and the Saccharomyces Genome Database (SGD) [7]. Each of these groups has their own databases and many use their own data model to describe their annotations. There is no single place at which all sets of genome annotations can be found, and several sets are informally mirrored in multiple locations, leading to location-specific version differences. This can make it hazardous to exchange, combine and compare annotation data. Clearly, if genomic annotations were always described using the same language, then comparative analysis of the wealth of information distributed by these institutions would be enormously simplified: Hence the Sequence Ontology (SO) project. SO began 2 years ago, when a group of scientists and developers from the model organism databases - FlyBase, WormBase, Ensembl, SGD and MGI - came together to collect and unify the terms they used in their sequence annotation
A Genome-Wide Survey of Genetic Variation in Gorillas Using Reduced Representation Sequencing
Aylwyn Scally, Bryndis Yngvadottir, Yali Xue, Qasim Ayub, Richard Durbin, Chris Tyler-Smith
PLOS ONE , 2013, DOI: 10.1371/journal.pone.0065066
Abstract: All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. Whole genome assembly projects have provided an invaluable foundation for understanding genetics in all four genera, but to date genetic studies of multiple individuals within great ape species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,006,670 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.
YFitter: Maximum likelihood assignment of Y chromosome haplogroups from low-coverage sequence data
Luke Jostins,Yali Xu,Shane McCarthy,Qasim Ayub,Richard Durbin,Jeff Barrett,Chris Tyler-Smith
Quantitative Biology , 2014,
Abstract: Low-coverage short-read resequencing experiments have the potential to expand our understanding of Y chromosome haplogroups. However, the uncertainty associated with these experiments mean that haplogroups must be assigned probabilistically to avoid false inferences. We propose an efficient dynamic programming algorithm that can assign haplogroups by maximum likelihood, and represent the uncertainty in assignment. We apply this to both genotype and low-coverage sequencing data, and show that it can assign haplogroups accurately and with high resolution. The method is implemented as the program YFitter, which can be downloaded from http://sourceforge.net/projects/yfitter/
Page 1 /19237
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.