We propose a fast data relay (FDR) mechanism to enhance existing CGRA (coarse-grained reconfigurable architecture). FDR can not only provide multicycle data transmission in concurrent with computations but also convert resource-demanding inter-processing-element global data accesses into local data accesses to avoid communication congestion. We also propose the supporting compiler techniques that can efficiently utilize the FDR feature to achieve higher performance for a variety of applications. Our results on FDR-based CGRA are compared with two other works in this field: ADRES and RCP. Experimental results for various multimedia applications show that FDR combined with the new compiler deliver up to 29% and 21% higher performance than ADRES and RCP, respectively. 1. Introduction and Related Work Much research has been done to evaluate the performance, power, and cost of reconfigurable architectures [1, 2]. Some use the standard commercial FPGAs, while others contain processors coupled with reconfigurable coprocessors (e.g., GARP [3], Chimaera [4]). Meanwhile, coarse-grained reconfigurable architecture (CGRA) has attracted a lot of attention from the research community [5]. CGRAs utilize an array of pre-defined processing elements (PEs) to provide computational power. Because the PEs are capable of doing byte or word-level computations efficiently, CGRAs can provide higher performance for data intensive applications, such as video and signal processing applications. In addition, CGRAs are coarse grained so they have smaller communication and configuration overhead costs compared to fine grained field programmable gate arrays (FPGAs). Based on how PEs are organized in a CGRA, the existing CGRAs can be generally classified into linear array architecture and mesh-based architecture. In linear array architecture, PEs are organized in one or several linear arrays. Representative works in this category are RaPiD [6] and PipeRench [7]. RaPiD can speed up highly regular, computational intensive applications by deep pipelining the application on a chain of RaPiD cells. PipeRench provides speedup for pipelined application by utilizing PEs to form reconfigurable pipeline stages that are then interconnected with a crossbar. The linear array organization is highly efficient when the computations can be linearly pipelined. With the emergence of many 2D video applications, the linear array organization becomes less flexible and inefficient to support block-based applications [8]. Therefore, a number of mesh-based CGRAs are proposed. Representative works in this
References
[1]
S. Hauck and A. DeHon, Eds., Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation (Systems on Silicon), Morgan Kaufmann, Boston, Mass, USA, 2007.
[2]
T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk, and P. Y. K. Cheung, “Reconfigurable computing: architectures and design methods,” IEE Proceedings—Computers and Digital Techniques, vol. 152, no. 2, article 193.
[3]
J. R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a reconfigurable coprocessor,” in Proceedings of the 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 12–21, April 1997.
[4]
Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “Chimaera: a high-performance architecture with a tightly-coupled reconfigurable functional unit,” in Proceedings of the The 27th Annual International Symposium on Computer Architecture (ISCA '00), pp. 225–235, June 2000.
[5]
R. Hartenstein, “Coarse grain reconfigurable architecture (embedded tutorial),” in Proceedings of the 16th Asia South Pacific Design Automation Conference (ASP-DAC '01), pp. 564–570, 2001.
[6]
C. Ebeling, D. C. Cronquist, and P. Franklin, “RaPiD–reconfigur-able pipelined datapath,” in Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers (FPL '96), 1996.
[7]
S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Matt, and R. R. Taylor, “PipeRench: a reconfigurable architecture and compiler,” Computer, vol. 33, no. 4, pp. 70–77, 2000.
[8]
H. Singh, M. H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications,” IEEE Transactions on Computers, vol. 49, no. 5, pp. 465–481, 2000.
[9]
R. W. Hartenstein and R. Kress, “Datapath synthesis system for the reconfigurable datapath architecture,” in Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC '95), pp. 479–484, September 1995.
[10]
E. Mirsky and A. DeHon, “MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources,” in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '96), pp. 157–166, April 1996.
[11]
B. Mei, S. Vernalde, D. Verkest, and R. Lauwereins, “Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: a case study,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '04), pp. 1224–1229, February 2004.
[12]
O. Colavin and D. Rizzo, “A scalable wide-issue clustered VLIW with a reconfigurable interconnect,” in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '03), pp. 148–158, November 2003.
[13]
M. B. Taylor, W. Lee, J. Miller et al., “Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams,” in Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA '04), pp. 2–13, June 2004.
[14]
S. Friedman, A. Carroll, B. Van Essen, B. Ylvisaker, C. Ebeling, and S. Hauck, “SPR: an architecture-adaptive CGRA mapping tool,” in Proceedings of the 7th ACM SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '09), pp. 191–200, February 2009.
[15]
H. Park, K. Fan, S. Mahlke, T. Oh, H. Kim, and H. S. Kim, “Edge-centric modulo scheduling for coarse-grained reconfigurable architectures,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08), pp. 166–176, October 2008.
[16]
G. Lee, K. Choi, and N. D. Dutt, “Mapping multi-domain applications onto coarse-grained reconfigurable architectures,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 5, pp. 637–650, 2011.
[17]
T. Suzuki, H. Yamada, T. Yamagishi et al., “High-throughput, low-power software-defined radio using reconfigurable processors,” IEEE Micro, vol. 31, no. 6, pp. 19–28, 2011.
[18]
Z. Kwok and S. J. E. Wilton, “Register file architecture optimization in a coarse-grained reconfigurable architecture,” in Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '05), pp. 35–44, April 2005.
[19]
S. Cadambi and S. C. Goldstein, “Efficient place and route for pipeline reconfigurable architectures,” in Proceedings of the International Conference on Computer Design (ICCD '00), pp. 423–429, September 2000.
[20]
S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, “Register organization for media processing,” in Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA '00), pp. 375–386, January 2000.
[21]
R. Balasubraamonian, S. Dwarkadas, and D. H. Albonesi, “Reducing the complexity of the register file in dynamic superscalar processors,” in Proceedings of the 34th Annual International Symposium on Microarchitecture (ACM/IEEE '01), pp. 237–248, December 2001.
[22]
B. Mei, F. J. Veredas, and B. Masschelein, “Mapping an H.264/AVC decoder onto the adres reconfigurable architecture,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '05), pp. 622–625, August 2005.
[23]
C. Lattner, “Introduction to the LLVM Compiler Infrastructure,” in Itanium Conference and Expo, April 2006.
[24]
J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, chapter 3, Morgan Kauffmann, Boston, Mass, USA, 4th edition, 2006.
[25]
G. D. Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.
[26]
R. Nair, “A simple yet effective technique for global wiring,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 6, no. 2, pp. 165–172, 1987.