%0 Journal Article %T FPGA Acceleration of Communication-Bound Streaming Applications: Architecture Modeling and a 3D Image Compositing Case Study %A Tobias Schumacher %A Tim S¨¹£¿ %A Christian Plessl %A Marco Platzner %J International Journal of Reconfigurable Computing %D 2011 %I Hindawi Publishing Corporation %R 10.1155/2011/760954 %X Reconfigurable computers usually provide a limited number of different memory resources, such as host memory, external memory, and on-chip memory with different capacities and communication characteristics. A key challenge for achieving high-performance with reconfigurable accelerators is the efficient utilization of the available memory resources. A detailed knowledge of the memories' parameters is key for generating an optimized communication layout. In this paper, we discuss a benchmarking environment for generating such a characterization. The environment is built on IMORC, our architectural template and on-chip network for creating reconfigurable accelerators. We provide a characterization of the memory resources available on the XtremeData XD1000 reconfigurable computer. Based on this data, we present as a case study the implementation of a 3D image compositing accelerator that is able to double the frame rate of a parallel renderer. 1. Introduction Reconfigurable accelerators achieve performance gains over CPUs by turning application hot spots into customized hardware cores and providing customized memory architectures to deliver the required high data bandwidth. Typical reconfigurable platforms for high-performance computing come with a certain fixed memory architecture with no or limited possibility to change the size and organization of the external memory on an per-application basis. A specific challenge is to find new methods for reducing the design effort for accelerators which are capable of using the given memory layout in a flexible yet effective way. For supporting reconfigurable accelerator design, we have created the IMORC: Infrastructure for Performance Monitoring and Optimization of Reconfigurable Computers [1, 2]. IMORC consists of an architectural template and an on-chip network. An application is split into an arbitrary number of cores that run at full speed in their own clock domains and communicate asynchronously via FIFO-buffered links. IMORC inserts bitwidth conversion modules into the links which speeds up the accelerator design process and facilitates the reuse of developed processing cores. The IMORC infrastructure also includes memory controllers and host interfaces which provide the cores with a unified and transparent way of accessing different kinds of memory, for example, on-chip memory, off-chip memory, or host memory. Related work on architectural templates like IMORC includes SIMPPL [3], which also connects different cores in a field programmable gate array (FPGA) using asynchronous FIFOs. An example for a %U http://www.hindawi.com/journals/ijrc/2011/760954/