Gene Similarity-based Approaches for Determining Core-Genes of Chloroplasts  [PDF]
Bassam AlKindy,Christophe Guyeux,Jean-Fran?ois Couchot,Michel Salomon,Jacques M. Bahi
Computer Science , 2014,
Abstract: In computational biology and bioinformatics, the manner to understand evolution processes within various related organisms paid a lot of attention these last decades. However, accurate methodologies are still needed to discover genes content evolution. In a previous work, two novel approaches based on sequence similarities and genes features have been proposed. More precisely, we proposed to use genes names, sequence similarities, or both, insured either from NCBI or from DOGMA annotation tools. Dogma has the advantage to be an up-to-date accurate automatic tool specifically designed for chloroplasts, whereas NCBI possesses high quality human curated genes (together with wrongly annotated ones). The key idea of the former proposal was to take the best from these two tools. However, the first proposal was limited by name variations and spelling errors on the NCBI side, leading to core trees of low quality. In this paper, these flaws are fixed by improving the comparison of NCBI and DOGMA results, and by relaxing constraints on gene names while adding a stage of post-validation on gene sequences. The two stages of similarity measures, on names and sequences, are thus proposed for sequence clustering. This improves results that can be obtained using either NCBI or DOGMA alone. Results obtained with this quality control test are further investigated and compared with previously released ones, on both computational and biological aspects, considering a set of 99 chloroplastic genomes.
Improved Core Genes Prediction for Constructing well-supported Phylogenetic Trees in large sets of Plant Species  [PDF]
Bassam AlKindy,Huda Al-Nayyef,Christophe Guyeux,Jean-Fran?ois Couchot,Michel Salomon,Jacques M. Bahi
Computer Science , 2015, DOI: 10.1007/978-3-319-16483-0_38
Abstract: The way to infer well-supported phylogenetic trees that precisely reflect the evolutionary process is a challenging task that completely depends on the way the related core genes have been found. In previous computational biology studies, many similarity based algorithms, mainly dependent on calculating sequence alignment matrices, have been proposed to find them. In these kinds of approaches, a significantly high similarity score between two coding sequences extracted from a given annotation tool means that one has the same genes. In a previous work article, we presented a quality test approach (QTA) that improves the core genes quality by combining two annotation tools (namely NCBI, a partially human-curated database, and DOGMA, an efficient annotation algorithm for chloroplasts). This method takes the advantages from both sequence similarity and gene features to guarantee that the core genome contains correct and well-clustered coding sequences (\emph{i.e.}, genes). We then show in this article how useful are such well-defined core genes for biomolecular phylogenetic reconstructions, by investigating various subsets of core genes at various family or genus levels, leading to subtrees with strong bootstraps that are finally merged in a well-supported supertree.
Finding Communities of Related Genes  [PDF]
Dennis Wilkinson,Bernardo A. Huberman
Physics , 2002,
Abstract: We present an automated method of identifying communities of functionally related genes from the biomedical literature. These communities encapsulate human gene and protein interactions and identify groups of genes that are complementary in their function. We use graphs to represent the network of gene cooccurrences in articles mentioning particular keywords, and find that these graphs consist of one giant connected component and many small ones. In addition, the vertex degree distribution of the graphs follows a power law, whose exponent we determine. We then use an algorithm based on betweenness centrality to identify community structures within the giant component. The different structures are then aggregated into a final list of communities, whose members are weighted according to how strongly they belong to them. Our method is efficient enough to be applicable to the entire Medline database, and yet the information it extracts is significantly detailed, applicable to a particular problem, and interesting in and of itself. We illustrate the method in the case of colon cancer and demonstrate important features of the resulting communities.
Finding flavor genes
Philippe Reymond
Genome Biology , 2000, DOI: 10.1186/gb-2000-1-2-reports0057
Abstract: Aharoni et al. randomly isolated 1,701 cDNA clones from a strawberry fruit cDNA library and 480 clones from petunia corolla (as control) and printed the PCR-amplified clones on chemically modified glass slides using a robotic device. They used these microarrays to monitor changes in gene expression at three fruit developmental stages (from green to red). Using a rigorous statistical analysis, the authors found that 401 clones were differentially expressed between all three stages, with 177 clones being upregulated between the green and red stages. Sequences of the latter group of genes revealed that more than 50% were related to primary and secondary metabolism. From the other sequences potentially involved in flavor formation, Aharoni et al. identified a novel gene (SAAT) for an alcohol acetyltransferase, an enzyme that catalyzes the final step in the synthesis of volatile esters. This gene shows 16-fold greater expression during the red stage than the green stage of fruit development. The authors expressed recombinant SAAT in Escherichia coli and confirmed that the enzyme has alcohol acetyltransferase activity. Analysis of a series of potential substrates suggests that SAAT is responsible for formation of the predominant esters found in ripe strawberries.Access to Arabidopsis cDNA microarrays is provided by the Arabidopsis Functional Genomics Consortium (AFGC). Links to information on plant microarrays can also be found via the Virtual library: plant-arrays.Large-scale cDNA microarrays are now used with model systems to investigate global patterns of gene expression at the level of the whole organism. The utility of microarrays that cover a restricted portion of the genome, like that described in this paper, will become increasingly recognized, however. This paper is a first example of the use of customized plant cDNA microarrays from a non-model system. It provides a good example of how a small selected array can be used to study a particular developmental proces
Hybrid Genetic Algorithm and Lasso Test Approach for Inferring Well Supported Phylogenetic Trees based on Subsets of Chloroplastic Core Genes  [PDF]
Bassam AlKindy,Christophe Guyeux,Jean-Fran?ois Couchot,Michel Salomon,Christian Parisod,Jacques M. Bahi
Computer Science , 2015,
Abstract: The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of "problematic" genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained if the number of problematic genes is low, the problem being to determine the largest subset of core genes that produces the best supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, we propose an hybrid approach that embeds both genetic algorithms and statistical tests. Given a set of organisms, the result is a pipeline of many stages for the production of well supported phylogenetic trees. The proposal has been applied to different cases of plant families, leading to encouraging results for these families.
CoreGenes: A computational tool for identifying and cataloging "core" genes in a set of small genomes
Nikhat Zafar, Raja Mazumder, Donald Seto
BMC Bioinformatics , 2002, DOI: 10.1186/1471-2105-3-12
Abstract: CoreGenes is a global JAVA-based interactive data mining tool that identifies and catalogs a "core" set of genes from two to five small whole genomes simultaneously. CoreGenes performs hierarchical and iterative BLASTP analyses using one genome as a reference and another as a query. Subsequent query genomes are compared against each newly generated "consensus." These iterations lead to a matrix comprising related genes from this set of genomes, e. g., viruses, mitochondria and chloroplasts. Currently the software is limited to small genomes on the order of 330 kilobases or less.A computational tool CoreGenes has been developed to analyze small whole genomes globally. BLAST score-related and putatively essential "core" gene data are displayed as a table with links to GenBank for further data on the genes of interest. This web resource is available at http://pumpkins.ib3.gmu.edu:8080/CoreGenes webcite or http://www.bif.atcc.org/CoreGenes webcite.The development of genomics instrumentation, technology and methodology, as well as their integration and deployment in many fields of research, has evolved from producing manageable small streams of DNA sequence data to generating an inundating amount of DNA sequence and whole genome data. This massive amount of raw DNA sequence data can be described simply and aptly as a "tsunami" – a tremendous and unexpected wave. An unprecedented wave can be either overwhelming or overwhelmed, depending upon the preparedness of investigators. Preparations include having available or developing appropriate computational tools. One particular area of continuing concern is the ability to separate interesting and relevant data from "noise." This process is known as data mining and is enhanced by the development of effective and "user-friendly" bioinformatics tools and computational methods [1-10].Many researchers have been interested in studying individual proteins, identifying single genes and characterizing putative genes, i.e., "open readi
Two Wheat Glutathione Peroxidase Genes Whose Products Are Located in Chloroplasts Improve Salt and H2O2 Tolerances in Arabidopsis  [PDF]
Chao-Zeng Zhai, Lei Zhao, Li-Juan Yin, Ming Chen, Qing-Yu Wang, Lian-Cheng Li, Zhao-Shi Xu, You-Zhi Ma
PLOS ONE , 2013, DOI: 10.1371/journal.pone.0073989
Abstract: Oxidative stress caused by accumulation of reactive oxygen species (ROS) is capable of damaging effects on numerous cellular components. Glutathione peroxidases (GPXs, EC are key enzymes of the antioxidant network in plants. In this study, W69 and W106, two putative GPX genes, were obtained by de novo transcriptome sequencing of salt-treated wheat (Triticum aestivum) seedlings. The purified His-tag fusion proteins of W69 and W106 reduced H2O2 and t-butyl hydroperoxide (t-BHP) using glutathione (GSH) or thioredoxin (Trx) as an electron donor in vitro, showing their peroxidase activity toward H2O2 and toxic organic hydroperoxide. GFP fluorescence assays revealed that W69 and W106 are localized in chloroplasts. Quantitative real-time PCR (Q-RT-PCR) analysis showed that two GPXs were differentially responsive to salt, drought, H2O2, or ABA. Isolation of the W69 and W106 promoters revealed some cis-acting elements responding to abiotic stresses. Overexpression of W69 and W106 conferred strong tolerance to salt, H2O2, and ABA treatment in Arabidopsis. Moreover, the expression levels of key regulator genes (SOS1, RbohD and ABI1/ABI2) involved in salt, H2O2 and ABA signaling were altered in the transgenic plants. These findings suggest that W69 and W106 not only act as scavengers of H2O2 in controlling abiotic stress responses, but also play important roles in salt and ABA signaling.
Knotty-Centrality: Finding the Connective Core of a Complex Network  [PDF]
Murray Shanahan, Mark Wildie
PLOS ONE , 2012, DOI: 10.1371/journal.pone.0036579
Abstract: A network measure called knotty-centrality is defined that quantifies the extent to which a given subset of a graph’s nodes constitutes a densely intra-connected topologically central connective core. Using this measure, the knotty centre of a network is defined as a sub-graph with maximal knotty-centrality. A heuristic algorithm for finding subsets of a network with high knotty-centrality is presented, and this is applied to previously published brain structural connectivity data for the cat and the human, as well as to a number of other networks. The cognitive implications of possessing a connective core with high knotty-centrality are briefly discussed.
Finding disease candidate genes by liquid association
Ker-Chau Li, Aarno Palotie, Shinsheng Yuan, Denis Bronnikov, Daniel Chen, Xuelian Wei, Oi-Wa Choi, Janna Saarela, Leena Peltonen
Genome Biology , 2007, DOI: 10.1186/gb-2007-8-10-r205
Abstract: Studies aiming to identify susceptibility genes in complex diseases have proceeded along two lines. The traditional candidate gene approach is limited by our ability to come up with a comprehensive list of biologically related genes. On the other hand, the 'hypothesis free' approach relies on genome-wide scans for disease loci, typically via linkage in exceptionally large families or via association in case control studies. Multiple sclerosis (MS), which is one of the most common neurologic disorders affecting young adults, is characterized by demyelination and reactive gliosis [1]. Analogous to many complex traits, genome scans in MS have identified numerous chromosomal loci often with only a nominal evidence for linkage to MS [2-6]. With the notable exception of the human leukocyte antigen (major histocompatibility complex [MHC]) locus on 6p21, evidence for specific MS genes emerging from these studies is still scanty. Thus far, the only associated non-HLA genes replicated in multiple populations are the PRKCA gene [7] and the recently reported IL2RA and IL7R genes [8]. For MS, as for most complex traits, the loci derived from linkage scans have remained quite wide because of multiple uncertainties concerning the disease model in statistical analyses. To expedite the process of gene identification in these wide DNA regions, we need novel approaches to identify potentially involved pathways and to prioritize genes on identified loci for further sequencing efforts.Our idea is to turn to full genome functional studies for these goals. As illustrated in Figure 1, our approach takes advantage of the availability of abundant microarray data and a wealth of genomic/proteomic knowledge base from the public domain. Our intention is to integrate information from both the candidate gene and the full genome scan (thus far mostly family-based linkage) approaches. In this report we use two previously reported MS susceptibility genes, identified in the same study sample [7,9], n
An Integrated Approach for Finding Overlooked Genes in Shigella  [PDF]
Junping Peng,Jian Yang,Qi Jin
PLOS ONE , 2012, DOI: 10.1371/journal.pone.0018509
Abstract: The completion of numerous genome sequences introduced an era of whole-genome study. However, many genes are missed during genome annotation, including small RNAs (sRNAs) and small open reading frames (sORFs). In order to improve genome annotation, we aimed to identify novel sRNAs and sORFs in Shigella, the principal etiologic agents of bacillary dysentery.
