全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

FPGA Acceleration of Communication-Bound Streaming Applications: Architecture Modeling and a 3D Image Compositing Case Study

DOI: 10.1155/2011/760954

Full-Text   Cite this paper   Add to My Lib

Abstract:

Reconfigurable computers usually provide a limited number of different memory resources, such as host memory, external memory, and on-chip memory with different capacities and communication characteristics. A key challenge for achieving high-performance with reconfigurable accelerators is the efficient utilization of the available memory resources. A detailed knowledge of the memories' parameters is key for generating an optimized communication layout. In this paper, we discuss a benchmarking environment for generating such a characterization. The environment is built on IMORC, our architectural template and on-chip network for creating reconfigurable accelerators. We provide a characterization of the memory resources available on the XtremeData XD1000 reconfigurable computer. Based on this data, we present as a case study the implementation of a 3D image compositing accelerator that is able to double the frame rate of a parallel renderer. 1. Introduction Reconfigurable accelerators achieve performance gains over CPUs by turning application hot spots into customized hardware cores and providing customized memory architectures to deliver the required high data bandwidth. Typical reconfigurable platforms for high-performance computing come with a certain fixed memory architecture with no or limited possibility to change the size and organization of the external memory on an per-application basis. A specific challenge is to find new methods for reducing the design effort for accelerators which are capable of using the given memory layout in a flexible yet effective way. For supporting reconfigurable accelerator design, we have created the IMORC: Infrastructure for Performance Monitoring and Optimization of Reconfigurable Computers [1, 2]. IMORC consists of an architectural template and an on-chip network. An application is split into an arbitrary number of cores that run at full speed in their own clock domains and communicate asynchronously via FIFO-buffered links. IMORC inserts bitwidth conversion modules into the links which speeds up the accelerator design process and facilitates the reuse of developed processing cores. The IMORC infrastructure also includes memory controllers and host interfaces which provide the cores with a unified and transparent way of accessing different kinds of memory, for example, on-chip memory, off-chip memory, or host memory. Related work on architectural templates like IMORC includes SIMPPL [3], which also connects different cores in a field programmable gate array (FPGA) using asynchronous FIFOs. An example for a

References

[1]  T. Schumacher, C. Plessl, and M. Platzner, “IMORC: application mapping, monitoring and optimization for highperformance reconfigurable computing,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '09), pp. 275–278, IEEE Computer Society, 2009.
[2]  T. Schumacher, C. Plessl, and M. Platzner, “An accelerator for k-th nearest neighbor thinning based on the IMORC infrastructure,” in Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL '09), pp. 338–344, IEEE, September 2009.
[3]  L. Shannon and P. Chow, “Simplifying the integration of processing elements in computing systems using a programmable controller,” in Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '05), vol. 2005, pp. 63–72, IEEE, 2005.
[4]  C. Steffen, “Parametrization of algorithms and FPGA accelerators to predict performance,” in Proceedings of the Reconfigurable System Summer Institute (RSSI '07), pp. 17–20, 2007.
[5]  B. Holland, K. Nagarajan, C. Conger, A. Jacobs, and A. D. George, “RAT: a methodology for predicting performance in application design migration to FPGAs,” in Proceedings of the High-Performance Reconfigurable Computing Technologies and Applications Workshop (HPRTCA '07), 2007.
[6]  S. Koehler, J. Curreri, and A. D. George, “Performance analysis challenges and framework for high-performance reconfigurable computing,” Parallel Computing, vol. 34, no. 4-5, pp. 217–230, 2008.
[7]  M. C. Smith and G. D. Peterson, “Analytical modeling for high performance reconfigurable computers,” in Proceedings of the International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS '02), July 2002.
[8]  M. C. Smith and G. D. Peterson, “Parallel application performance on shared high performance reconfigurable computing resources,” Performance Evaluation, vol. 60, no. 1–4, pp. 107–125, 2005.
[9]  T. Schumacher, T. Sü?, C. Plessl, and M. Platzner, “Communication performance characterization for reconfigurable accelerator design on the XD1000,” in Proceedings of the International Conference on Reconfigurable computing and FPGAs (ReConFig '09), pp. 119–124, IEEE Computer Society, Los Alamitos, Calif, USA, 2009.
[10]  D. Slogsnat, A. Giese, and U. Brüning, “A versatile, low latency Hypertransport core,” in Proceedings of the 15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '07), pp. 45–52, ACM, February 2007.
[11]  “RAMspeed”, http://www.alasir.com/software/ramspeed/.
[12]  “STREAM Benchmark”, http://www.cs.virginia.edu/stream/.
[13]  S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs, “A sorting classification of parallel rendering,” Tech. Rep. TR94-023, 8, 1994.
[14]  G. Stoll, M. Eldridge, D. Patterson et al., “Lightning-2: a high-performance display subsystem for PC clusters,” in Proceedings of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '01), pp. 141–148, August 2001.
[15]  S. Dominick and R. Yang, “Anywhere pixel router,” in Proceedings of the ACM/IEEE 5th International Workshop on Projector Camera Systems (PROCAMS '08), pp. 1–2, ACM, August 2008.
[16]  S. Muraki, M. Ogata, K.-L. Ma, et al., “Next-generation visual supercomputing using pc clusters with volume graphics hardware devices,” in Proceedings of the Conference on Supercomputing, p. 51, ACM, New York, NY, USA, 2001.
[17]  L. Moll, A. Heirich, and M. Shand, “Sepia: scalable 3D compositing using PCI pamette,” in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM '99), pp. 146–157, 1999.
[18]  S. Lombeyda, L. Moll, M. Shand, D. Breen, and A. Heirich, “Scalable interactive volume rendering using off-the-shelf components,” in Proceedings of the Symposium on Parallel and Large-Data Visualization and Graphics, pp. 115–121, IEEE, Piscataway, NJ, USA, 2001.
[19]  S. Eilemann and R. Pajarola, “Direct send compositing for parallel sort-last rendering,” in Proceedings of the ACM SIGGRAPH ASIA Courses, December 2008.
[20]  K. L. Ma, J. S. Painter, C. D. Hansen, and M. F. Krogh, “Parallel volume rendering using binary-swap compositing,” IEEE Computer Graphics and Applications, vol. 14, no. 4, pp. 59–68, 1994.
[21]  “OpenMPI homepage”, http://www.open-mpi.org/.
[22]  T. Schumacher, E. Lübbers, P. Kaufmann, and M. Platzner, “Accelerating the cube cut problem with an FPGA-augmented compute cluster,” in Proceedings of the ParaFPGA Symposium International Conference on Parallel Computing (ParCo '07), vol. 38, pp. 749–756, John von Neumann Institute for Computing, Jülich, Germany, 2007.
[23]  T. Schumacher, Performance modeling and analysis in highperformance reconfigurable computing, Ph.D. Thesis, University of Paderborn, 2011.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133