Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
References
[1]
Harris, M. (2007) Optimizing Parallel Reduction in CUDA. Nvidia Developer Technology, 2, 70.
[2]
Hong, S. and Hyesoon, K. (2010) An Integrated GPU Power and Performance Model. ACM SIGARCH Computer Architecture News, 38, 280-289. https://doi.org/10.1145/1816038.1815998
[3]
Volkov, V. and Demmel, J.W. (2008) Benchmarking GPUs to Tune Dense Linear Algebra. Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Austin, 15-21 November 2008. https://doi.org/10.1109/SC.2008.5214359
[4]
Wu, Y.N., Tsai, P.-A., Muralidharan, S., Parashar, A., Sze, V. and Emer, J. (2023) HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity. Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, Association for Computing Machinery, New York, NY, USA, 1106-1120. https://doi.org/10.1145/3613424.3623786
[5]
Liu, G., et al. (2018) A Scalable Parallel Method for Large-Scale Matrix Computations. The Journal of Supercomputing, 74, 6641-6656.
[6]
Peled, L., Mannor, S., Weiser, U. and Etsion, Y. (2015) Semantic Locality and Context-Based Prefetching Using Reinforcement Learning. ACM SIGARCH Computer Architecture News, 43, 285-297. https://doi.org/10.1145/2872887.2749473
[7]
Aldinucci, M., Drocco, M., Mastrostefano, F. and Vanneschi, M. (2018) Hardware-Conscious Autonomic Management of Distributed Workflows. International Conference on Algorithms and Architectures for Parallel Processing, Springer, Cham, 27-31 August 2018, 343-359.
[8]
Ballard, G., Zheng, G., Demmel, J. and Yelick, K. (2017) An Efficient and Generic Event-Based Profiling Framework for GPU Architectures. IEEE Transactions on Parallel and Distributed Systems, 29, 169-182.
[9]
Li, B.Y., et al. (2022) Optimizing Data Layout for Training Deep Neural Networks. Companion Proceedings of the Web Conference 2022, New York, April 2022, 548-554. https://doi.org/10.1145/3487553.3524856
[10]
Cai, Z., Hu, L., Shi, B., Chen, Y., Hu, C. and Tang, J. (2023) DSP: Efficient GNN Training with Multiple GPUs. Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, February 2023, 392-404. https://doi.org/10.1145/3572848.3577528
[11]
Wan, L.P., et al. (2022) Improving I/O Performance for Exascale Applications through Online Data Layout Reorganization. IEEE Transactions on Parallel and Distributed Systems, 33, 878-890. https://doi.org/10.1109/TPDS.2021.3100784
[12]
Stoltzfus, L., et al. () Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling. Proceedings of the Workshop on Memory Centric High Performance Computing, Dallas, November 2018, 45-49. https://doi.org/10.1145/3286475.3286482
[13]
Zhong, J.L. and He, B.S. (2014) Medusa: Simplified Graph Processing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25, 1543-1552. https://doi.org/10.1109/TPDS.2013.111
[14]
Neapolitan, R. and Naimipour, K. (2008) Foundations of Algorithms Using C Pseudocode. 3rd Edition, Jones and Bartlett Publishers, Inc., Sudbury, USA.
[15]
NVIDIA (2023) CUDA C Programming Guide (Version 12.2). https://docs.nvidia.com/cuda/archive/12.2.0/pdf/CUDA_C_Best_Practices_Guide.pdf
[16]
Segura, A., Arnau, J.-M. and González, A. (2019) SCU: A GPU Stream Compaction Unit for Graph Processing. Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, Arizona, 22-26 June 2019, 424-435. https://doi.org/10.1145/3307650.3322254