In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an open-source software program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than BLAST-based methods. Protein clusters generated by pClust are used to create protein profiles for each species in the dendrogram, these profiles forming a correlation filter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein profile for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to the highest correlation with profiles in the correlation filter library. This work was initiated because of our interest in plasmids, and each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to augment the dendrogram. 1. Introduction The availability of complete proteomes for hundreds of thousands of species provides an unprecedented opportunity to study genetic relationships among a large number of species. However, the necessary software tools for handling massive amounts of data must first be developed before we can exploit the availability of these proteomes. Currently the tools used for clustering either are restricted in terms of the number of proteomes that can be examined because of the time required to obtain results or else are restricted in terms of their sensitivity. For example, clustering by means of hidden markov models (HMM), multiple sequence alignment, and pairwise sequence alignment by means of the Smith-Waterman alignment algorithm are limited by their time complexity. The Smith-Waterman algorithm, a dynamic programming algorithm, is known to give optimal alignment between two protein sequences for a given similarity matrix [1], but alignment of two sequences of lengths and requires time. On the other hand, heuristic approximate alignment methods, frequently based on BLAST and its variants [2], reduce the computational time required; for example, in practice BLAST effectively reduces the time to , but this comes at the risk of losing sensitivity to
References
[1]
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981.
[2]
S. F. Altschul, T. L. Madden, A. A. Sch?ffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.
[3]
D. L. Brutlag, J.-P. Dautricourt, R. Diaz, J. Fier, B. Moxon, and R. Stamm, “BLAZE: an implementation of the Smith-Waterman sequence comparison algorithm on a massively parallel computer,” Computers and Chemistry, vol. 17, no. 2, pp. 203–207, 1993.
[4]
E. G. Shpaer, M. Robinson, D. Yee, J. D. Candlin, R. Mines, and T. Hunkapiller, “Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA,” Genomics, vol. 38, no. 2, pp. 179–191, 1996.
[5]
C. Wu, A. Kalyanaraman, and W. R. Cannon, “PGraph: efficient parallel construction of large-scale protein sequence homology graphs,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 10, Article ID 6127863, pp. 1923–1933, 2012.
[6]
D. Gibson, R. Kumar, and A. Tomkins, “Discovering large dense subgraphs in massive graphs,” in Proceedings of the 31st International Conference on Very Large Data Bases, pp. 721–732, September 2005.
[7]
A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel, “Efficient clustering of large EST data sets on parallel computers,” Nucleic Acids Research, vol. 31, no. 11, pp. 2963–2974, 2003.
[8]
E. Bapteste, Y. Boucher, J. Leigh, and W. F. Doolittle, “Phylogenetic reconstruction and lateral gene transfer,” Trends in Microbiology, vol. 12, no. 9, pp. 406–411, 2004.
[9]
E. Fidelma Boyd, C. W. Hill, S. M. Rich, and D. L. Hard, “Mosaic structure of plasmids from natural populations of Escherichia coli,” Genetics, vol. 143, no. 3, pp. 1091–1100, 1996.
[10]
H. Ochman, J. G. Lawrence, and E. A. Grolsman, “Lateral gene transfer and the nature of bacterial innovation,” Nature, vol. 405, no. 6784, pp. 299–304, 2000.
[11]
C. M. Thomas, “Paradigms of plasmid organization,” Molecular Microbiology, vol. 37, no. 3, pp. 485–491, 2000.
[12]
C. M. Thomas and K. M. Nielsen, “Mechanisms of, and barriers to, horizontal gene transfer between bacteria,” Nature Reviews Microbiology, vol. 3, no. 9, pp. 711–721, 2005.
[13]
M. Couturier, F. Bex, P. L. Bergquist, and W. K. Maas, “Identification and classification of bacterial plasmids,” Microbiological Reviews, vol. 52, no. 3, pp. 375–395, 1988.
[14]
J. J. Dennis, “The evolution of IncP catabolic plasmids,” Current Opinion in Biotechnology, vol. 16, no. 3, pp. 291–298, 2005.
[15]
J. Huang and J. P. Gogarten, “Ancient horizontal gene transfer can benefit phylogenetic reconstruction,” Trends in Genetics, vol. 22, no. 7, pp. 361–366, 2006.
[16]
S. Karlin and C. Burge, “Dinucleotide relative abundance extremes: a genomic signature,” Trends in Genetics, vol. 11, no. 7, pp. 283–290, 1995.
[17]
S. Karlin, “Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes,” Trends in Microbiology, vol. 9, no. 7, pp. 335–343, 2001.
[18]
M. Brilli, A. Mengoni, M. Fondi, M. Bazzicalupo, P. Liò, and R. Fani, “Analysis of plasmid genes by phylogenetic profiling and visualization of homology relationships using Blast2Network,” BMC Bioinformatics, vol. 9, article 551, 2008.
[19]
S. Halary, J. W. Leigh, B. Cheaib, P. Lopez, and E. Bapteste, “Network analyses structure genetic diversity in independent genetic worlds,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 1, pp. 127–132, 2010.
[20]
D. Sen, G. A. Van der Auwera, L. M. Rogers, C. M. Thomas, C. J. Brown, and E. M. Top, “Broad-host-range plasmids from agricultural soils have IncP-1 backbones with diverse accessory genes,” Applied and Environmental Microbiology, vol. 77, pp. 7975–7983, 2011.
[21]
Y. Zhou, D. R. Call, and S. L. Broschat, “Genetic relationships among 527 Gram-negative bacterial plasmids,” Plasmid, vol. 68, no. 2, pp. 133–141, 2012.
[22]
D. R. Call, R. S. Singer, D. Meng et al., “blaCMY-2-positive IncA/C plasmids from Escherichia coli and Salmonella enterica are a distinct component of a larger lineage of plasmids,” Antimicrobial Agents and Chemotherapy, vol. 54, no. 2, pp. 590–596, 2010.
[23]
J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, pp. 59–66, 1988.
[24]
M. S. Stigler, “Francis Galton's account of the invention of correlation,” Statistical Science, vol. 4, pp. 73–79, 1989.
[25]
K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, and S. Kumar, “MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods,” Molecular Biology and Evolution, vol. 28, no. 10, pp. 2731–2739, 2011.
[26]
M. Lescot, S. Audic, C. Robert et al., “The genome of Borrelia recurrentis, the agent of deadly louse-borne relapsing fever, is a degraded subset of tick-borne Borrelia duttonii,” PLoS Genetics, vol. 4, no. 9, Article ID e1000185, 2008.
[27]
J. E. Purser and S. J. Norris, “Correlation between plasmid content and infectivity in Borrelia burgdorferi,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 25, pp. 13865–13870, 2000.
[28]
P. Norberg, M. Bergstrom, V. Jethava, D. Dubhashi, and M. Hermansson, “The IncP-1 plasmid backbone adapts to different host bacterial species and evolves through homologous recombination,” Nature Communications, vol. 2, article 268, 2011.
[29]
A. Buda and A. Jarynowski, “Life-time of correlations and its applications,” Wydawnictwo Niezalezne, vol. 1, pp. 5–21, 2010.
[30]
J. Cohen, Statistical Power Analysis For the Behavioral Sciences, Law-rence Erlbaum Associates, Hillsdale, NJ, USA, 2nd edition, 1988.