Complexity in processor microarchitecture and the related issues of power density, hot spots and wire delay, are seen to be a major concern for design migration into low nanometer technologies of the future. This paper evaluates the hardware cost of an alternative to register-file organization, the superscalar stack issue array (SSIA). We believe this is the first such reported study using discrete stack elements. Several possible implementations are evaluated, using a 90?nm standard cell library as a reference model, yielding delay data and FO4 metrics. The evaluation, including reference to ASIC layout, RC extraction, and timing simulation, suggests a 4-wide issue rate of at least four Giga-ops/sec at 90?nm and opportunities for twofold future improvement by using more advanced design approaches. 1. Introduction Current trends in semiconductor technology, and in particular the International Technology Roadmap for Semiconductors [1], suggest that future concerns in microarchitecture at the VLSI level will pose significant challenges. These include increasing power density [2], progressively severe thermal hot spots in increasingly complex designs [3], the impact of growing static power [4], and the problem of wire versus gate-delay and power scaling [5, 6]. Such problems are often most acutely exposed in key mainstream processor components such as cache, register related logic such as reorder buffers, rename logic, and the register file itself. Any alternative scheme to the traditional register-based computing paradigm can therefore open up the possibility of new approaches to these problems. However, register files are so highly optimized that measuring alternatives now requires complete layout of an optimal design for comparison, followed by timing and power analysis and nothing as simple as functional comparison of abstract logic. This paper focuses upon one possible unexplored option for operand storage which is alternative in its structure to that of a register file. The questions we examine are (a) can a LIFO (last-in-first-out) stack support superscalar operand access and (b) what is its performance relative to established mainstream approaches. This work is undertaken with a 90?nm UMC CMOS process library; however, we ultimately utilize FO4 as a delay metric [7] in order to provide a general measure of performance that can be scaled to other process nodes. The work is undertaken using standard cell digital libraries and not at the transistor level. Although this is not therefore an optimal solution, it permits rapid assessment of multiple
References
[1]
http://www.itrs.net/links/2011itrs/home2011.htm.
[2]
G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, “QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores,” in Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture (MICRO '44), pp. 163–174, ACM, December 2011.
[3]
R. J. Ribando and K. Skadron, “Many-core design from a thermal perspective,” in Proceedings of the 45th Design Automation Conference (DAC '08), pp. 746–749, Anaheim, Calif, USA, June 2008.
[4]
D. Sylvester and H. Kaul, “Future performance challenges in nanometer design,” in Proceedings of the 38th Design Automation Conference, pp. 3–8, ACM, June 2001.
[5]
H. O. Ron, K. W. Mai, and A. Fellow, “The future of wires,” Proceedings of the IEEE, vol. 89, no. 4, pp. 490–504, 2001.
[6]
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA '11), pp. 365–376, IEEE, 2011.
[7]
I. E. Sutherland, R. F. Sproull, and D. F. Harris, Logical Effort: Designing Fast CMOS Circuits, Morgan Kaufmann, 1999.
[8]
P. Koopman, “A preliminary exploration of optimized stack code generation,” in Proceedings of the Rochester Forth Conference, Rochester, NY, USA, 1992.
[9]
B. Chris, “Inter-boundary scheduling of stack operands: a preliminary study,” in Proceedings of the EuroForth, pp. 3–11, 2000.
[10]
M. Shannon and C. Bailey, “Global stack allocation: register allocation for stack machines,” in Proceedings of the Euroforth Conference, 2006.
[11]
C. Bailey and M. Weeks, “An experimental investigation of single and multiple issue ILP speedup for stack-based code,” in Proceedings of the EuroForth Conference, pp. 19–24, 2000.
[12]
US Patent 6148391: System for Simultaneously Accessing one or More Stack Elements by multiple functional units, and related US patent 6026485: Instruction Folding for A Stack-Machine.
[13]
C. Bailey, “A proposed mechanism for super-pipelined instruction-issue for ILP stack machines,” in Proceedings of the EUROMICRO Systems on Digital System Design (DSD '04), pp. 121–129, IEEE, September 2004.
[14]
C. Bailey and H. Shi, “Instruction level parallelism of stack-code under varied issue widths, and one-level branch prediction,” in Proceedings of the IADIS International Conference on Applied Computing (AC '05), pp. 23–30, Algarve, Portugal, February 2005.
[15]
C. Bailey, R. Sotudeh, and M. Ould-Khaoua, “The effects of local variable optimisation in A C-based stack processor environment,” in Proceedings of the 1994 Euroforth Conference, 1994.
[16]
T. J. Stanley and R. G. Wedig, “A performance analysis of automatically managed top of stack buffers,” in Proceedings of the 14th Annual International Symposium on Computer Architecture (ISCA '87), pp. 272–281, ACM, 1987.
[17]
C. Bailey, “A proposed mechanism for super-pipelined instruction-issue for ILP stack machines,” in Proceedings of the Euromicro Symposium on Digital System Design (DSD '04), pp. 121–129, IEEE, 2004.
[18]
C. Jesshope, “Microthreading a model for distributed instruction-level concurrency,” Parallel Processing Letters, vol. 16, no. 2, pp. 209–228, 2006.
[19]
S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,” IEEE Transactions on Computers, vol. 60, no. 7, pp. 913–922, 2011.
[20]
N. Burgess, “Logical Effort analysis of multi-port register file architectures,” in Proceedings of the Conference Record of the 37th Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 887–891, IEEE, November 2003.
[21]
M. Golden and H. Partovi, “500 MHz, write-bypassed, 88-entry, 90-bit register file,” in Proceedings of the Symposium on VLSI Circuits, pp. 105–108, IEEE, June 1999.
[22]
R. K. Krishnamurthy, A. Alvandpour, G. Balamurugan, N. R. Shanbhag, K. Soumyanath, and S. Y. Borkar, “A 130-nm 6-GHz 256 × 32 bit leakage-tolerant register file,” IEEE Journal of Solid-State Circuits, vol. 37, no. 5, pp. 624–632, 2002.
[23]
R. L. Franch, J. Ji, and C. L. Chen, “A 640-ps, 0.25-μm CMOS, -b three-port register file,” IEEE Journal Solid State Circuits, vol. 32, no. 8, pp. 1288–1292, 1997.
[24]
O. Takahashi, J. Silberman, S. Dhong, P. Hofstee, and N. Aoki, “690ps read-access latency register file for a GHz integer microprocessor,” in Proceedings of the 1998 IEEE International Conference on Computer Design, pp. 6–10, Austin, Tex, USA, October 1998.
[25]
W. Hwang, R. V. Joshi, and W. H. Henkels, “A 500-MHz, 32-word x 64-bit, eight-port self-resetting CMOS register file,” IEEE Journal of Solid-State Circuits, vol. 34, no. 1, pp. 56–67, 1999.
[26]
C. H. Hua and W. Hwang, “Low power multiple access port register file design in 100?nm CMOS technology,” in Proceedings of the 14th VLSI/CAD Symposium, Hualien, Taiwan, August 2003.
[27]
R. V. Joshi, W. Hwang, S. C. Wilson, and C. T. Chuang, ““Cool low power” 1?GHz multi-port register file and dynamic latch in 1.8?V, 0.25?μm SOI and bulk technology,” in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '00), pp. 203–206, July 2000.
[28]
M. Kondo and H. Nakamura, “A small, fast and low-power register file by bit-partitioning,” in Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA-11 '05), pp. 40–49, IEEE, February 2005.
[29]
R. D. Jolly, “A 9-ns, 1.4-gigabyte/s, 17-ported CMOS register file,” IEEE Journal of Solid-State Circuits, vol. 26, no. 10, pp. 1407–1412, 1991.
[30]
C. Asato, “A 14-port 3.8-ns 116-word 64-b read-renaming register file,” IEEE Journal of Solid-State Circuits, vol. 30, no. 11, pp. 1254–1258, 1995.
[31]
N. Tzartzanis, W. W. Walker, H. Nguyen, and A. Inoue, “A 34word?×?64b 10R/6W write-through self-timed dual-supply-voltage register file,” in Proceedings of the IEEE International Solid-State Circuits Conference, Digest of Technical Papers (ISSCC '02), vol. 2, pp. 338–537, San Francisco, Calif, USA, February 2002.
[32]
N. S. Kim and T. Mudge, “The microarchitecture of a low power register file,” in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '03), pp. 384–389, August 2003.
[33]
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clock rate versus IPC: the end of the road for conventional microarchitectures,” in Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA '00), vol. 28, pp. 248–259, ACM, New York, NY, USA, 2000.
[34]
K. Puttaswamy and G. H. Loh, “Implementing register files for high-performance microprocessors in a die-stacked (3D) technology,” in Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, IEEE, Karlsruhe, Germany, March 2006.
[35]
R. Balasubraamonian, S. Dwarkadas, and D. H. Albonesi, “Reducing the complexity of the register file in dynamic superscalar processors,” in Proceedings of the 34th ACM/IEEE Annual International Symposium on Microarchitecture (MICRO '01), pp. 237–248, December 2001.
[36]
J. Curz, A. Gonzalez, M. Valero, and N. P. Tophan, “Multibanked register file architectures,” in Proceedings of the 27th International Symposium on Computer Architecture (ISCA '00), Vancouver, Canada, June 2000.
[37]
K. I. Farkas, N. P. Jouppi, and P. Chow, “Register file design considerations in dynamically scheduled processors,” in Proceedings of the 2nd International Symposium on High-Performance Computer Architecture (HPCA '96), pp. 40–51, February 1996.
[38]
D. Chinnery and K. Keutzer, Closing the Gap between ASIC & Custom: Tools and Techniques for High-Performance ASIC Design, Springer, 2002.
[39]
P. Montesinos, W. Liu, and J. Torrellas, “Using register lifetime predictions to protect register files against soft errors,” in Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '07), pp. 286–295, IEEE, June 2007.
[40]
L. Jin, W. Wu, J. Yang, C. Zhang, and Y. Zhang, “Reduce register files leakage through discharging cells,” in Proceedings of the 24th International Conference on Computer Design (ICCD '06), pp. 114–119, IEEE, October 2006.
[41]
V. Nookala and S. S. Sapatnekar, “Designing optimized pipelined global interconnects: Algorithms and methodology impact,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '05), pp. 608–611, IEEE, May 2005.
[42]
Z. Hu and M. Martonosi, “Reducing register file power consumption by exploiting value lifetime characteristics,” Proceedings of the Workshop on Complexity-Effective Design (WCED '00), vol. 1, pp. 1829–1841, 2000.
[43]
R. Singh, G.-M. Hong, M. Kim, J. Park, W.-Y. Shin, and S. Kim, “Static-switching pulse domino: a switching-aware design technique for wide fan-in dynamic multiplexers,” Integration, the VLSI Journal, vol. 45, no. 3, pp. 253–262, 2012.