Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. 1. Introduction Protein structure alignment is a useful strategy for structural biology. Most of the alignment methods rely on structure comparison to identify structural, evolutionary, and functional relationships between proteins [1]. In general, these methods align proteins based on structural similarity. A structural alignment can identify the evolutionary equivalent residues when the aligned proteins share a common ancestor. Unlike sequence alignment tools, which focus on equivalent residues, structural alignment methods focus on conserved protein structure. Therefore, structural alignments of remote homologous proteins are more reliable than sequence alignments. Structural alignment identifies functional mechanisms by comparing functionally related proteins and can also annotate the function of proteins whose structures have been detected. Several protein structural alignment methods [2–8] compare protein structures by structural similarity based on secondary structure elements, as well as intra- and intermolecular atomic distances. The basic idea of structure alignment is to identify the secondary structural elements, cluster these elements into groups, and score the best substructure alignment. The Vector Alignment Search Tool (VAST) [2] compares protein structures according to the continuous distribution of domains in the fold space. VAST has been used to compare all known Protein Data Bank (PDB) domains to each other. The alignment results are presented in NCBI’s Molecular Modeling Database [9].
References
[1]
M. Gerstein, R. Jansen, T. Johnson, J. Tsai, and W. Krebs, “Motions in a database framework: from structure to sequence,” Rigidity Theory and Applications, pp. 401–442, 1999.
[2]
J. F. Gibrat, T. Madej, and S. H. Bryant, “Surprising similarities in structure comparison,” Current Opinion in Structural Biology, vol. 6, no. 3, pp. 377–385, 1996.
[3]
L. Holm and C. Sander, “Touring protein fold space with Dali/FSSP,” Nucleic Acids Research, vol. 26, no. 1, pp. 316–319, 1998.
[4]
C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton, “CATH—a hierarchic classification of protein domain structures,” Structure, vol. 5, no. 8, pp. 1093–1108, 1997.
[5]
I. N. Shindyalov and P. E. Bourne, “Protein structure alignment by incremental combinatorial extension (CE) of the optimal path,” Protein Engineering, vol. 11, no. 9, pp. 739–747, 1998.
[6]
A. Guerler and E. W. Knapp, “Novel protein folds and their nonsequential structural analogs,” Protein Science, vol. 17, no. 8, pp. 1374–1382, 2008.
[7]
W. R. Taylor, T. P. Flores, and C. A. Orengo, “Multiple protein structure alignment,” Protein Science, vol. 3, no. 10, pp. 1858–1870, 1994.
[8]
Y. Yang, J. Zhan, H. Zhao, and Y. Zhou, “A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction,” Proteins, vol. 80, pp. 2080–2088, 2012.
“Hadoop—Apache Software Foundation project,” http://hadoop.apache.org/.
[11]
R. C. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics,” BMC Bioinformatics, vol. 11, no. 12, article S1, 2010.
[12]
M. C. Schatz, “CloudBurst: Highly sensitive read mapping with MapReduce,” Bioinformatics, vol. 25, no. 11, pp. 1363–1369, 2009.
[13]
G. Sudha Sadasivam and G. Baktavatchalam, “A novel approach to multiple sequence alignment using hadoop data grids,” International Journal of Bioinformatics Research and Applications, vol. 6, no. 5, pp. 472–483, 2010.
[14]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg, “Searching for SNPs with cloud computing,” Genome Biology, vol. 10, no. 11, article R134, 2009.
[15]
L. Euler, “Formulae generales pro translatione quacunque corporum rigidorum,” Novi Commentarii academiae scientiarum Petropolitanae, vol. 20, pp. 189–207, 1775.
[16]
A. Grap, A Treatise on Gyrostatics and Rotational Motion, MacMillan, London, UK, 1918.
[17]
J. Munkres, “Algorithms for the assignment and transportation problems,” Journal of the Society for Industrial and Applied Mathematics, vol. 5, pp. 32–38, 1957.
[18]
F. Bourgeois and J. C. Lassalle, “Extension of the Munkres algorithm for the assignment problems to rectangular matrices,” Communications of the ACM, vol. 14, no. 12, pp. 802–804, 1971.