全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Exploring Many-Core Design Templates for FPGAs and ASICs

DOI: 10.1155/2012/439141

Full-Text   Cite this paper   Add to My Lib

Abstract:

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms. 1. Introduction Direct hardware implementations, using platforms such as FPGAs and ASICs, possess a huge potential for exploiting application-specific parallelism and performing efficient computation. As a result, the overall performance of custom hardware-based implementations is often higher than that of software-based ones [1, 2]. To attain bare metal performance, however, programmers must employ hardware design principles such as clock management, state machines, pipelining, and device specific memory management—all concepts well outside the expertise of application-oriented software developers. These observations raise a natural question: does there exist a more productive abstraction for high-performance hardware design? Based on modern programming disciplines, one viable approach would (1) allow programmers to express parallelism through some API defined in a high-level programming language, (2) support coarse-grain multithreading and fine-grain threading while permitting bit-level resource control, and (3) reduce the effort required to repurpose the implemented hardware platform for different

References

[1]  M. Lin, I. Lebedev, and J. Wawrzynek, “Highthroughput Bayesian computing machine with reconfigurable hardware,” in Proceedings of the 18th annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '10), pp. 73–82, ACM, Monterey, California, USA, 2010.
[2]  M. Lin, I. Lebedev, and J. Wawrzynek, “OpenRCL: from sea-of-gates to sea-of-cores,” in Proceedings of the 20th IEEE International Conference on Field Programmable Logic and Applications, Milano, Italy, 2010.
[3]  Wikipedia, “C-to-hdl,” November 2009, http://en.wikipedia.org/wiki/C_to_HDL/.
[4]  M. Gokhale and J. Stone, “Napa c: compiling for a hybrid risc/fpga architecture,” in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '98), Napa, Calif, USA, 1998.
[5]  T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “Garp architecture and C compiler,” Computer, vol. 33, no. 4, pp. 62–69, 2000.
[6]  M. Budiu, G. Venkataramani, T. Chelcea, and S. C. Goldstein, “Spatial computation,” in Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI '04), pp. 14–26, New York, NY, USA, October 2004.
[7]  J. Wawrzynek, D. Patterson, M. Oskin et al., “RAMP: research accelerator for multiple processors,” IEEE Micro, vol. 27, no. 2, pp. 46–57, 2007.
[8]  A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and M. W. Hwu, “Fcuda: enabling efficient compilation of cuda kernels onto fpgas,” in Proceedings of the 7th IEEE Symposium on Application Specific Processors (SASP '09), San Francisco, Calif, USA, 2009.
[9]  M. Owaida, N. Bellas, K. Daloukas, and C. D. Antonopoulos, “Synthesis of platform architectures from opencl programs,” in Proceedings of the 19th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM '11), Salt Lake City, Utah, USA, 2011.
[10]  J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
[11]  D. Heckerman, D. Geiger, and D. M. Chickering, “Learning Bayesian networks: the combination of knowledge and statistical data,” Machine Learning, vol. 20, no. 3, pp. 197–243, 1995.
[12]  J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Francisco, Calif, USA, 1988.
[13]  C. Fletcher, I. Lebedev, N. Asadi, D. Burke, and J. Wawrzynek, “Bridging the GPGPU-FPGA efficiency gap,” in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '11), pp. 119–122, New York, NY, USA, 2011.
[14]  N. Bani Asadi, C. W. Fletcher, G. Gibeling, et al., “Paralearn: a massively parallel, scalable system for learning interaction networks on fpgas,” in Proceedings of the 24th ACM International Conference on Supercomputing, pp. 83–94, ACM, Ibaraki, Japan, 2010.
[15]  D. M. Chickering, “Learning Bayesian Networks is NP-Complete,” in Learning from Data: Artificial Intelligence and Statistics V, pp. 121–130, Springer, New York, NY, USA, 1996.
[16]  B. Ellis and W. H. Wong, “Learning causal Bayesian network structures from experimental data,” Journal of the American Statistical Association, vol. 103, no. 482, pp. 778–789, 2008.
[17]  M. Teyssier and D. Koller, “Ordering-based search: a simple and effective algorithm for learning Bayesian networks,” in Proceedings of the 21st Conference on Uncertainty in AI (UAI '5), pp. 584–590, Edinburgh, UK, July 2005.
[18]  N. Friedman and D. Koller, “Being Bayesian about network structure,” in Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 201–210, Morgan Kaufmann, San Francisco, Calif, USA, 2000.
[19]  Khronos OpenCL Working Group, The OpenCL Specification, version 1.0.29, December 2008, http://khronos.org/registry/cl/specs/opencl-1.0.29.pdf.
[20]  M. Lin, I. Lebedev, and J. Wawrzynek, “OpenRCL: low-power high-performance computing with reconfigurable devices,” in Proceedings of the 18th International Symposium on Field Programmable Gate Array, 2010.
[21]  NVIDIA OpenCL Best Practices Guide, 2009, http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf.
[22]  C. Lattner and V. Adve, “LLVM: a compilation framework for lifelong program analysis & transformation,” in Proceedings of the International Symposium on Code Generation and Optimization (CGO '04), pp. 75–86, Palo Alto, Calif, USA, March 2004.
[23]  G. Gibeling, et al., “Gatelib: a library for hardware and software research,” Tech. Rep., 2010.
[24]  J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, chapter 5, Prentice Hall, New York, NY, USA, 2nd edition, 2003.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133