This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs. 1. Introduction Since it was first announced in 1965, Moore’s law has stood up the test of time, providing exponential increases in computing power for science and engineering problems over time. However, while this law was largely followed through increases in transistor integration levels and clock frequencies, this is no longer possible as power consumption and heat dissipation are becoming major hurdles in the face of further clock frequency increases, the so-called frequency or power wall problem. In order to keep Moore’s law going, general-purpose processor manufacturers, for example, Intel and AMD, are now relying on multicore chip technology in which multiple cores run simultaneously on the same chip at capped clock frequencies to limit power consumption. While this has the potential to provide considerable speed-up for science and engineering applications, it is also creating a semantic gap between applications, traditionally written in sequential code, and hardware, as multicore technologies need to be programmed in parallel to take advantage of their performance potential. This problem is however also opening a window of opportunity for hitherto niche parallel computer technologies such as Field Programmable Gate Arrays (FPGAs) and
References
[1]
R. Durbin, S. Eddy, S. Krogh, and G. Michison, Biological Sequence Analysis: Probabilistic Models for Proteins and Nucleic Acids, Cambridge University Press, 1998.
[2]
S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
[3]
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981.
[4]
K. Benkrid, Y. Liu, and A. Benkrid, “A highly parameterized and efficient FPGA-Based skeleton for pairwise biological sequence alignment,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 4, pp. 561–570, 2009.
[5]
T. Oliver, B. Schmidt, and D. Maskell, “Hyper customized processors for bio-sequence database scanning on FPGAs,” in Proceedings of the ACM/SIGDA 13th ACM International Symposium on Field Programmable Gate Arrays (FPGA '05), pp. 229–237, February 2005.
[6]
G. A. Harrison, J. M. Tanner, D. R. Pilbeam, and P. T. Baker, Human Biology: An Introduction to Human Evolution, Variation, Growth, and Adaptability, Oxford Science, 1988.
[7]
HP Proliant DL145 Server Series, Hewlett-Packard, 2007, http://www.hp.com/.
[8]
Celoxica, The RCHTX FPGA Acceleration Card Data Sheets, Celoxica, 2007, http://www.hypertransport.org/docs/tech/rchtx_datasheet_screen.pdf.
[9]
Xilinx Corporation, Virtex-4 Family Data Sheets, 2007, http://www.xilinx.com/support/documentation/virtex-4_data_sheets.htm.
F. Blagojevic, D. S. Nikolopous, A. Stamatkis, and C. D. Antonopoulos, “Dynamic multigrain parallelization on the cell broadband engine,” in Proceedings of the ACM SIGPLAN Principles and Practice of Parallel Computing (PPoPP '07), San Jose, Calif, USA, March 2007.
[15]
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, “Introduction to the cell multiprocessor,” IBM Journal of Research and Development, vol. 49, no. 4-5, pp. 589–604, 2005.
[16]
C. Ling, K. Benkrid, and T. Hamada, “A parameterisable and scalable smith-Waterman algorithm implementation on CUDA-compatible GPUs,” in Proceedings of the IEEE 7th Symposium on Application Specific Processors (SASP '09), pp. 94–100, July 2009.
[17]
Y. Munekawa, F. Ino, and K. Hagihara, Design and Implementation of the Smith-Waterman Algorithm on the CUDA-Compatible GPU, 2008.
[18]
A. Bairoch and R. Apweiler, The SWISS-PROT protein knowledgebase and its supplement TrEMBL, Nucleic Acid Research, Release 56.3, October 2008.
[19]
Y. Song, G. M. Striemer, and A. Akoglu, “Performance analysis of IBM Cell Broadband Engine on sequence alignment,” in Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS '09), pp. 439–446, August 2009.
[20]
W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence comparison,” Proceedings of the National Academy of Sciences of the United States of America, vol. 85, no. 8, pp. 2444–2448, 1988.
[21]
M. Farrar, “Striped Smith-Waterman speeds database searches six times over other SIMD implementations,” Bioinformatics, vol. 23, no. 2, pp. 156–161, 2007.
[22]
Y. Liu, K. Benkrid, A. Benkrid, and S. Kasap, “An FPGA-based web server for high performance biological sequence alignment,” in Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS '09), pp. 361–368, August 2009.
[23]
K. Dohi, K. Benkrid, C. Ling, T. Hamada, and Y. Shibata, “Highly efficient mapping of the Smith-Waterman algorithm on CUDA-compatible GPUs,” in Proceedings of the 21st IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP '10), pp. 29–36, July 2010.