OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

International Journal of Reconfigurable Computing 2011

The Potential for a GPU-Like Overlay Architecture for FPGAs

DOI: 10.1155/2011/514581

Jeffrey Kingyens,J. Gregory Steffan

Full-Text Cite this paper Add to My Lib

Abstract:

We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath. 1. Introduction As FPGAs become increasingly dense and powerful, with high-speed I/Os, hard multipliers, and plentiful memory blocks, they have consequently become more desirable platforms for computing. Recently there is building interest in using FPGAs as accelerators for high-performance computing, leading to commercial products such as the SGI RASC which integrates FPGAs into a blade server platform, and XtremeData and Nallatech that offer FPGA accelerator modules that can be installed alongside a conventional CPU in a standard dual-socket motherboard. The challenge for such systems is to provide a programming model that is easily accessible for the programmers in the scientific, financial, and other data-driven arenas that will use them. Developing an accelerator design in a hardware description language such as Verilog is difficult, requiring an expert hardware designer to perform all of the implementation, testing, and debugging required for developing real hardware. Behavioral synthesis techniques—that allow a programmer to write code in a high-level language such as C that is then automatically translated into custom hardware circuits—have long-term promise [1–3], but currently have many limitations. What is needed is a high-level programming model specifically tailored to making the creation of custom FPGA-based accelerators easy. In contrast with the approaches of custom hardware and behavioral synthesis, a more familiar model is to use a standard high-level language and environment to program a processor, or in this case an

References

[1]	J. Koo, D. Fernandez, A. Haddad, and W. Gross, “Evaluation of a high-level-language methodology for high-performance reconfigurable computers,” in Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP '07), pp. 30–35, July 2007.
[2]	D. Lau, O. Pritchard, and P. Molson, “Automated generation of hardware accelerators with direct memory access from ANSI/ISO standard C functions,” in Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '06), pp. 45–54, April 2006.
[3]	J. L. Tripp, K. D. Peterson, C. Ahrens, J. D. Poznanovic, and M. B. Gokhale, “Trident: an FPGA compiler framework for floating-point algorithms,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '05), pp. 317–322, August 2005.
[4]	J. Hensley, “AMD CTM overview,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '07), ACM, August 2007.
[5]	B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown, “A multithreaded soft processor for SoPC area reduction,” in Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '06), pp. 131–140, April 2006.
[6]	M. Labrecque and J. G. Steffan, “Improving pipelined soft processors with multithreading,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '07), pp. 210–215, August 2007.
[7]	R. Moussali, N. Ghanem, and M. A. R. Saghir, “Supporting multithreading in configurable soft processor cores,” in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '07), pp. 155–159, October 2007.
[8]	P. Yiannacouras, J. G. Steffan, and J. Rose, “Vespa: portable, scalable, and flexible fpga-based vector processors,” in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '08), 2008.
[9]	J. Yu, G. Lemieux, and C. Eagleston, “Vector processing as a soft-core CPU accelerator,” in Proceedings of the 16th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '08), pp. 222–231, February 2008.
[10]	M. Labrecque, P. Yiannacouras, and J. G. Steffan, “Scaling soft processor systems,” in Proceedings of the 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '08), pp. 195–205, April 2008.
[11]	W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard, “Cg: a system for programming graphics hardware in a c-like language,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '03), pp. 896–907, ACM, New York, NY, USA, 2003.
[12]	J. Kingyens and J. G. Steffan, “A GPU-inspired soft processor for high-throughput acceleration,” in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW '10), April 2010.
[13]	“Developing fpga coprocessors for performance-accelerated spacecraft image processing,” Xcell Journal Second Quarter, pp. 22–26, 2008.
[14]	O. Mencer, “ASC: a stream compiler for computing with FPGAs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 9, Article ID 1673737, pp. 1603–1617, 2006.
[15]	I. Page, “Closing the gap between hardware and software: hardware-software cosynthesis at Oxford,” in Proceedings of the IEE Colloquium on Hardware-Software Cosynthesis for Reconfigurable Systems, pp. 201–211, February 1996, Digest no: 1996/036.
[16]	P. Yiannacouras, J. Rose, and J. Gregory Steffan, “The microarchitecture of FPGA-based soft processors,” in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '05), pp. 202–212, New York, NY, USA, 2005.
[17]	J. Yu, G. Lemieux, and C. Eagleston, “Vector processing as a soft-core CPU accelerator,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '08), pp. 222–231, ACM, New York, NY, USA, 2008.
[18]	R. Dimond, O. Mencer, and W. Luk, “Application-specific customisation of multi-threaded soft processors,” IEE Proceedings: Computers and Digital Techniques, vol. 153, no. 3, pp. 173–180, 2006.
[19]	P. James-Roxby, P. Schumacher, and C. Ross, “A single program multiple data parallel processing platform for FPGAs,” in Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '04), pp. 302–303, April 2004.
[20]	A. K. Jones, R. Hoare, I. S. Kourtev et al., “A 64-way VLIW/SIMD FPGA architecture and design flow,” in Proceedings of the 11th IEEE International Conference on Electronics, Circuits and Systems (ICECS '04), pp. 499–502, December 2004.
[21]	C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for FPGAs,” in Proceedings of the 18th ACM SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '10), pp. 41–50, February 2010.
[22]	M. A. R. Saghir, M. El-Majzoub, and P. Akl, “Datapath and isa customization for soft vliw processors,” in Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA (ReConFig '06), pp. 1–10, September 2006.
[23]	M. Peercy, M. Segal, and D. Gerstmann, “A performance-oriented data parallel virtual machine forgpus,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '06), p. 184, ACM, New York, NY, USA, 2006.
[24]	W. W .L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp formation and scheduling for efficient GPU control flow,” in Proceedings of the 40th Annual International Symposium on Microarchitecture (MICRO '07), pp. 407–418, IEEE Computer Society, Washington, DC, USA, 2007.
[25]	D. Slogsnat, A. Giese, and U. Brüning, “A versatile, low latency HyperTransport core,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '07), pp. 45–52, ACM, New York, NY, USA, 2007.
[26]	B. Holden, Latency Comparison between HyperTransport and PCI-Express In Communications Systems, World Wide Web Electronic Publication, 2006.
[27]	K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms formatrix-matrix multiplication,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 133–137, ACM, New York, NY, USA, 2004.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133