|
NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead ExecutionDOI: 10.1155/2012/915178 Abstract: Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4?KB NCOR operates at 329?MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32?KB NCOR operates at 278?Mhz and uses 269 logic elements. 1. Introduction Embedded system applications increasingly use soft processors implemented over reconfigurable logic. Embedded applications, like other application classes, evolve over time and their computation needs and structure change. If the experience with other application classes is any indication, embedded applications will evolve and incorporate algorithms with unstructured instruction-level parallelism. Existing soft processor implementations use in-order organizations since these organizations map well onto reconfigurable logic and offer adequate performance for most existing embedded applications. Previous work has shown that for programs with unstructured instruction-level parallelism, a 1-way out-of-order (OoO) processor has the potential to outperform a 2- or even a 4-way superscalar processor in a reconfigurable logic environment [1]. Conventional OoO processor designs are tuned for custom logic implementation and rely heavily on content addressable memories, multiported register files, and wide, multisource and multidestination datapaths. These structures are inefficient when implemented on an FPGA fabric. It is an open question whether it is possible to design an FPGA-friendly soft core that offers the benefits of OoO execution while overcoming the complexity and inefficiency of conventional OoO structures. A lower complexity alternative to OoO architectures is Runahead Execution, or simply Runahead, which offers most of the benefits of OoO execution [2]. Runahead relies on the observation that often most of the performance benefits of OoO execution result from allowing multiple outstanding main memory requests. Runahead extends a conventional
|