Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced. 1. Introduction Multiple sequence alignments (MSA) are an essential and widely used computational procedure for biological sequence analysis in molecular biology, computational biology, and bioinformatics. MSA are completed where homologous sequences are compared in order to perform phylogenetic reconstruction, protein secondary and tertiary structure analysis, and protein function prediction analysis [1]. Biologically good and accurate alignments can have significant meaning, showing relationships and homology between different sequences, and can provide useful information, which can be used to further identify new members of protein families. The accuracy of MSA is of critical importance due to the fact that many bioinformatics techniques and procedures are dependent on MSA results [1]. Due to MSA significance, many MSA algorithms have been developed. Unfortunately, constructing accurate multiple sequence alignments is a computationally intense and biologically complex task, and as such, no current MSA tool is likely to generate a biologically perfect result. Therefore, this area of research is very active, aiming to develop a method which can align thousands of sequences that are lengthy and produce high-quality alignments and in a reasonable time [2, 3]. Alignment speed and computational complexity are negatively affected when the number of sequences to be aligned increases. The recent advances in high throughput sequencing technologies means that this sequence output is growing at an exponential rate, the
References
[1]
C. Kemena and C. Notredame, “Upcoming challenges for multiple sequence alignment methods in the high-throughput era,” Bioinformatics, vol. 25, no. 19, pp. 2455–2465, 2009.
[2]
R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current Opinion in Structural Biology, vol. 16, no. 3, pp. 368–373, 2006.
[3]
C. Notredame, “Recent evolutions of multiple sequence alignment algorithms,” PLoS Computational Biology, vol. 3, no. 8, article e123, 2007.
[4]
Human Genome Project Information, 2013, http://web.ornl.gov/sci/techresources/Human_Genome/home.shtml.
G. K. C. O. Scientists, “Genome 10K: a proposal to obtain whole-genome sequence for 10, 000 vertebrate species,” Journal of Heredity, vol. 100, no. 6, pp. 659–674, 2009.
[7]
454 Life Sciences, a Roche Company, 2013, http://www.454.com/.
H. Li and N. Homer, “A survey of sequence alignment algorithms for next-generation sequencing,” Briefings in Bioinformatics, vol. 11, no. 5, pp. 473–483, 2010.
C. B. Do and K. Katoh, “Protein multiple sequence alignment,” Methods in Molecular Biology, vol. 484, pp. 379–413, 2008.
[13]
R. C. Edgar, “MUSCLE: a multiple sequence alignment method with reduced time and space complexity,” BMC Bioinformatics, vol. 5, article 113, 2004.
[14]
S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
[15]
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981.
[16]
I. M. Wallace, G. Blackshields, and D. G. Higgins, “Multiple sequence alignments,” Current Opinion in Structural Biology, vol. 15, no. 3, pp. 261–266, 2005.
[17]
K. Katoh and H. Toh, “Recent developments in the MAFFT multiple sequence alignment program,” Briefings in Bioinformatics, vol. 9, no. 4, pp. 286–298, 2008.
[18]
D.-F. Feng and R. F. Doolittle, “Progressive sequence alignment as a prerequisitetto correct phylogenetic trees,” Journal of Molecular Evolution, vol. 25, no. 4, pp. 351–360, 1987.
[19]
W. J. Wilbur and D. J. Lipman, “Rapid similarity searches of nucleic acid and protein data banks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 80, no. 3, pp. 726–730, 1983.
[20]
R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Research, vol. 32, no. 5, pp. 1792–1797, 2004.
[21]
F. Sievers, A. Wilm, D. Dineen et al., “Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega,” Molecular Systems Biology, vol. 7, article 539, 2011.
[22]
N. Saitou and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic trees,” Molecular Biology and Evolution, vol. 4, no. 4, pp. 406–425, 1987.
[23]
I. Gronau and S. Moran, “Optimal implementations of UPGMA and other common clustering algorithms,” Information Processing Letters, vol. 104, no. 6, pp. 205–210, 2007.
[24]
J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Research, vol. 22, no. 22, pp. 4673–4680, 1994.
[25]
K. Katoh and D. M. Standley, “MAFFT multiple sequence alignment software version 7: improvements in performance and usability,” Molecular Biology and Evolution, vol. 30, no. 4, pp. 772–780, 2013.
[26]
T. Lassmann and E. L. L. Sonnhammer, “Kalign—an accurate and fast multiple sequence alignment algorithm,” BMC Bioinformatics, vol. 6, article 298, 2005.
[27]
U. Roshan and D. R. Livesay, “Probalign: multiple sequence alignment using partition function posterior probabilities,” Bioinformatics, vol. 22, no. 22, pp. 2715–2721, 2006.
[28]
B. Morgenstern, “DIALIGN: multiple DNA and protein sequence alignment at BiBiServ,” Nucleic Acids Research, vol. 32, supplement 2, pp. W33–W36, 2004.
[29]
A. L?ytynoja and N. Goldman, “Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis,” Science, vol. 320, no. 5883, pp. 1632–1635, 2008.
[30]
R. K. Bradley, A. Roberts, M. Smoot et al., “Fast statistical alignment,” PLoS Computational Biology, vol. 5, no. 5, Article ID e1000392, 2009.
[31]
P. Di Tommaso, S. Moretti, I. Xenarios et al., “T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension,” Nucleic Acids Research, vol. 39, supplement 2, pp. W13–W17, 2011.
[32]
C. Notredame, D. G. Higgins, and J. Heringa, “T-coffee: a novel method for fast and accurate multiple sequence alignment,” Journal of Molecular Biology, vol. 302, no. 1, pp. 205–217, 2000.
[33]
C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou, “ProbCons: probabilistic consistency-based multiple sequence alignment,” Genome Research, vol. 15, no. 2, pp. 330–340, 2005.
[34]
Y. Liu, B. Schmidt, and D. L. Maskell, “MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities,” Bioinformatics, vol. 26, no. 16, pp. 1958–1964, 2010.
[35]
D. W. Mount, “Using iterative methods for global multiple sequence alignment,” Cold Spring Harbor Protocols, vol. 4, no. 7, Article ID pdb.top44, 2009.
[36]
O. Gotoh, “Optimal alignment between groups of sequences and its application to multiple sequence alignment,” Computer Applications in the Biosciences, vol. 9, no. 3, pp. 361–370, 1993.
[37]
C. Notredame and D. G. Higgins, “SAGA: sequence alignment by genetic algorithm,” Nucleic Acids Research, vol. 24, no. 8, pp. 1515–1524, 1996.
[38]
J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs,” Nucleic Acids Research, vol. 27, no. 13, pp. 2682–2690, 1999.
[39]
A. M. Lesk and C. Chothia, “How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins,” Journal of Molecular Biology, vol. 136, no. 3, pp. 225–270, 1980.
[40]
O. O'Sullivan, K. Suhre, C. Abergel, D. G. Higgins, and C. Notredame, “3DCoffee: combining protein sequences and structures within multiple sequence alignments,” Journal of Molecular Biology, vol. 340, no. 2, pp. 385–395, 2004.
[41]
F. Armougom, S. Moretti, O. Poirot et al., “Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee,” Nucleic Acids Research, vol. 34, supplement 2, pp. W604–W608, 2006.
[42]
X. Xia, S. Zhang, Y. Su, and Z. Sun, “MICAlign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields,” Bioinformatics, vol. 25, no. 11, pp. 1433–1434, 2009.
[43]
Z. Zhang, A. A. Sch?ffer, W. Miller et al., “Protein sequence similarity searches using patterns as seeds,” Nucleic Acids Research, vol. 26, no. 17, pp. 3986–3990, 1998.
[44]
M. C. Frith, N. F. W. Saunders, B. Kobe, and T. L. Bailey, “Discovering sequence motifs with arbitrary insertions and deletions,” PLoS Computational Biology, vol. 4, no. 4, Article ID e1000071, 2008.
[45]
H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research, vol. 18, no. 11, pp. 1851–1858, 2008.
[46]
R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotide alignment program,” Bioinformatics, vol. 24, no. 5, pp. 713–714, 2008.
[47]
B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome,” Genome Biology, vol. 10, no. 3, article R25, 2009.
[48]
“A Definition of The Cloud at Last?—Web Performance Watch,” 2013, http://blogs.keynote.com/the_watch/2010/05/for-as-long-as-ive-been-involved-in-writing-about-the-cloud-and-its-related-applications-ive-seen-the-basic-question.html.
[49]
“Virtualization is a key enabler of cloud computing,” 2013, http://www.itnewsafrica.com/2010/03/virtualization-is-a-key-enabler-of-cloud-computing/.
[50]
D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2007.
[51]
J. S?ding, “Protein homology detection by HMM-HMM comparison,” Bioinformatics, vol. 21, no. 7, pp. 951–960, 2005.
[52]
F. Sievers, D. Dineen, A. Wilm, and D. G. Higgins, “Making automated multiple alignments of very large numbers of protein sequences,” Bioinformatics, vol. 29, no. 8, pp. 989–995, 2013.
[53]
K. Katoh, K. Misawa, K.-I. Kuma, and T. Miyata, “MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform,” Nucleic Acids Research, vol. 30, no. 14, pp. 3059–3066, 2002.
“MIT credits Irish-based entrepreneur with co-coining term “cloud computing”,” 2013, http://www.siliconrepublic.com/cloud/item/24280-mit-credits-irish-based-ent.
[56]
“The Benefits Of Data Center Virtualization For Businesses, CloudTweaks,” 2013, http://www.cloudtweaks.com/2012/03/the-benefits-of-data-center-virtualization-for-businesses/.
[57]
“VMware Virtualization Software for Desktops, Servers & Virtual Machines for Public and Private Cloud Solutions,” 2013, http://www.vmware.com/.
[58]
Main Page—KVM, 2013, http://www.linux-kvm.org/page/Main_Page.
[59]
“Accelerating high-performance computing applications using parallel computing,” 2013, http://embedded-computing.com/articles/accelerating-using-parallel-computing/.
[60]
“Cloud 101: What the heck do IaaS, PaaS and SaaS companies do?” 2013, http://venturebeat.com/2011/11/14/cloud-iaas-paas-saas/.
[61]
Amazon Web Services, “Cloud Computing: Compute, Storage, Database,” 2013, http://aws.amazon.com/.
[62]
Microsoft Home Page, Devices and Services, 2013, http://www.microsoft.com/en-us/default.aspx.
V. Choudhary, “Software as a service: implications for investment in software development,” in Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS '07), January 2007.
[68]
G. Lawton, “Developing software online with platform-as-a-service technology,” Computer, vol. 41, no. 6, pp. 13–15, 2008.
[69]
J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[70]
T. Nguyen, W. Shi, and D. Ruden, “CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping,” BMC Research Notes, vol. 4, article 171, 2011.
[71]
M. C. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce,” Bioinformatics, vol. 25, no. 11, pp. 1363–1369, 2009.
[72]
L. Pireddu, S. Leo, and G. Zanetti, “Seal: a distributed short read mapping and duplicate removal tool,” Bioinformatics, vol. 27, no. 15, pp. 2159–2160, 2011.
[73]
“Blastreduce: high performance short read mapping with mapreduce,” 2013, http://www.cbcb.umd.edu/software/blastreduce.
[74]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg, “Searching for SNPs with cloud computing,” Genome Biology, vol. 10, no. 11, article R134, 2009.
[75]
M. C. Schatz, D. Sommer, D. Kelley, and M. Pop, “De novo assembly of large genomes using cloud computing,” in Proceedings of the CSHL Biology of Genomes Conference, 2010.
[76]
Y.-J. Chang, C. C. Chen, C. L. Chen, and J. M. Ho, “A de novo next generation genomic sequence assembler based on string graph and mapreduce cloud computing framework,” BMC Genomics, vol. 13, supplement 7, p. S28, 2012.
[77]
B. Langmead, K. D. Hansen, and J. T. Leek, “Cloud-scale RNA-sequencing differential expression analysis with Myrna,” Genome Biology, vol. 11, no. 8, p. R83, 2010.
[78]
D. Hong, A. Rhie, S.-S. Park et al., “FX: an RNA-seq analysis tool on the cloud,” Bioinformatics, vol. 28, no. 5, pp. 721–723, 2012.
[79]
L. Jourdren, M. Bernard, M. A. Dillies, and S. Le Crom, “Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses,” Bioinformatics, vol. 28, no. 11, pp. 1542–1543, 2012.
[80]
M. Niemenmaa, A. Kallio, A. Schumacher, P. Klemel?, E. Korpelainen, and K. Heljanko, “Hadoop-BAM: directly manipulating next generation sequencing data in the cloud,” Bioinformatics, vol. 28, no. 6, pp. 876–877, 2012.
[81]
B. D. O'Connor, B. Merriman, and S. F. Nelson, “SeqWare Query Engine: storing and searching sequence data in the cloud,” BMC Bioinformatics, vol. 11, supplement 12, article S2, 2010.
[82]
A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010.
[83]
S. J. Matthews and T. L. Williams, “MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees,” BMC Bioinformatics, vol. 11, supplement 1, article S15, 2010.
[84]
M. E. Colosimo, M. W. Peterson, S. Mardis, and L. Hirschman, “Nephele: genotyping via complete composition vectors and MapReduce,” Source Code for Biology and Medicine, vol. 6, article 13, 2011.
[85]
P. D. Vouzis and N. V. Sahinidis, “GPU-BLAST: using graphics processors to accelerate protein sequence alignment,” Bioinformatics, vol. 27, no. 2, pp. 182–188, 2011.
[86]
C.-M. Liu, T. Wong, E. Wu et al., “SOAP3: ultra-fast GPU-based parallel alignment tool for short reads,” Bioinformatics, vol. 28, no. 6, pp. 878–879, 2012.
[87]
S. Lewis, A. Csordas, S. Killcoyne, et al., “Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework,” BMC Bioinformatics, vol. 13, article 324, 2012.
[88]
A. Matsunaga, M. Tsugawa, and J. Fortes, “CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications,” in Proceedings of the 4th IEEE International Conference on eScience (eScience '08), pp. 222–229, December 2008.
[89]
X. Feng, R. Grossman, and L. Stein, “PeakRanger: a cloud-enabled peak caller for ChIP-seq data,” BMC Bioinformatics, vol. 12, article 139, 2011.
[90]
L. Zhang, S. Gu, Y. Liu, B. Wang, and F. Azuaje, “Gene set analysis in the cloud,” Bioinformatics, vol. 28, no. 2, pp. 294–295, 2012.
[91]
S. Leo, F. Santoni, and G. Zanetti, “Biodoop: bioinformatics on hadoop,” in Proceedings of the 38th International Conference Parallel Processing Workshops (ICPPW '09), pp. 415–422, September 2009.
[92]
H. Huang, S. Tata, and R. J. Prill, “BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters,” Bioinformatics, vol. 29, no. 1, pp. 135–136, 2013.
[93]
D. R. Kelley, M. C. Schatz, and S. L. Salzberg, “Quake: quality-aware detection and correction of sequencing errors,” Genome Biology, vol. 11, no. 11, article R116, 2010.
[94]
Apache Hadoop, 2013, http://hadoop.apache.org/.
[95]
How Hadoop Makes Short Work Of Big Data—Forbes, 2013, http://www.forbes.com/sites/netapp/2012/09/24/hadoop-big-data/.
[96]
Public Data Sets on Amazon Web Services (AWS), 2013, http://aws.amazon.com/publicdatasets/.
[97]
P. M. Kasson, “Computational biology in the cloud: methods and new insights from computing at scale,” in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, World Scientific, 2013.
[98]
S. Vijayakumar, A. Bhargavi, U. Praseeda, and S. A. Ahamed, “Optimizing sequence alignment in cloud using hadoop and MPP database,” in Proceedings of the 5th IEEE International Conference on Cloud Computing (CLOUD '12), 2012.
[99]
R. D. Sleator, “Proteins: form and function,” Bioeng Bugs, vol. 3, no. 2, pp. 80–85, 2012.