全部 标题 作者
关键词 摘要

电子学报  2015 

异构众核系统及其编程模型与性能优化技术研究综述

DOI: 10.3969/j.issn.0372-2112.2015.01.018, PP. 111-119

Keywords: 异构众核系统,高性能计算,异构计算,编程模型,性能优化

Full-Text   Cite this paper   Add to My Lib

Abstract:

异构众核系统已成为当前高性能计算领域重要的发展趋势.针对异构众核系统,从架构、编程、所支持的应用三方面分析对比当前不同异构系统的特点,揭示了异构系统的发展趋势及异构系统相对于传统多核并行系统的优势;然后从编程模型和性能优化方面分析了异构系统存在的问题和面临的挑战,以及国内外研究现状,结合当前研究存在的问题和难点,探讨了该领域进一步深入的研究方向;同时对两种典型的异构众核系统CPU+GPU和CPU+MIC进行不同应用类型的Benchmark测试,验证了两种异构系统不同的应用特点,为用户选择具体异构系统提供参考,在此基础上提出将两种众核处理器(GPU和MIC)结合在一个计算节点内构成新型混合异构系统;该新型混合异构系统可以利用两种众核处理器不同的处理优势,协同处理具有不同应用特点的复杂应用,同时分析了在该混合异构系统下必须要研究和解决的关键问题;最后对异构众核系统面临的挑战和进一步的研究方向进行了总结和展望.

References

[1]  Liu X,Smelyanskiy M,Chow E,et al.Efficient sparse matrix-vector multiplication on x86-based many-core processors[A].Proceedings of the 27th International Conference on Supercomputing [C].ACM,2013.273-282.
[2]  Saini S,Jin H,Jespersen D,et al.An early performance evaluation of many integrated core architecture based SGI rackable computing system [A].Proceedings of the 2013 International Conference for High Performance Computing,Networking,Storage and Analysis [C].ACM,2013.94.
[3]  Jeffers J,Reinders J.Intel Xeon Phi Coprocessor High Performance Programming[M].Newnes,2013.
[4]  Owens J D,Luebke D,Govindaraju N,et al.A survey of general purpose computation on graphics hardware[J] Computer Graphics Forum,2007,26(1):80-113.
[5]  王海峰,陈庆奎.图形处理器通用计算关键技术研究综述[J].计算机学报,2013,36(4):757-772. WANG Hai-Feng,CHEN Qing-Kui.General purpose computing ofgraphics processing unit:a survey[J].Chinese Journal of Computers,2013,36(4):757-772.(in Chinese)
[6]  王蕾,等.任务并行编程模型研究与进展[J].软件学报,2013,24(1):77-90. Wang L,et al.Research on task parallel programming model[J].Journal of Software,2013,24(1):77-90.(in Chinese)
[7]  Lee S,Min S J,Eigenmann R.OpenMP to GPGPU:a compiler framework for automatic translation and optimization[J].ACM Sigplan Notices,2009,44(4):101-110.
[8]  Lee S,Eigenmann R.OpenMPC:extended openMP programming and tuning for GPUs[A].Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis[C].IEEE,2010.1-11.
[9]  张保,等.CPU-GPU 系统中基于剖分的全局性能优化方法[J].西安交通大学学报,2012,46(2):17-23. Zhang Bao,et al.Profiling based optimization method for CPU-GPU heterogeneous parallel processing system[J].Journal of Xi''an Jiaotong University,2012,46(2):17-23.(in Chinese)
[10]  Wang P H,Collins J D,Chinya G N,et al.EXOCHI:architecture and programming environment for a heterogeneous multi-core multithreaded system[A].Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2007.156-166.
[11]  Luk C K,Hong S,Kim H.Qilin:exploiting parallelism on heterogeneous multi-processors with adaptive mapping[A].Proceedings of the the 42nd Annual IEEE/ACM International Symposium on Microarchitecture[C].IEEE,2009.45-55.
[12]  Jablin T B,Prabhu P,et al.Automatic CPU-GPU communication management and optimization[A].Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2011.142-151.
[13]  Lee J,Kim J,Seo S,et al.An OpenCL framework for heterogeneous multicores with local memory[A].Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2010.193-204.
[14]  Reyes R,López-Rodríguez I,Fumero J J,et al.accULL:An OpenACC implementation with CUDA and OpenCL support[J].Euro-Par Parallel Processing Lecture Notes in Computer Science,2012,7484:871-882.
[15]  Wienke S,Springer P,et al.OpenACC-first experiences with real-world applications[J].Euro-Par Parallel Processing Lecture Notes in Computer Science,2012,7484:859-870.
[16]  Maillard N.Hybrid parallel programming:evaluation of OpenACC[D].Universidade Federal do rio Grande do Sul,2012.
[17]  Chen C,Yang C,Tang T,et al.OpenACC to Intel offload:automatic translation and optimization[J].Computer Engineering and Technology Communications in Computer and Information Science,2013,396:111-120.
[18]  Kumar Pusukuri K,Gupta R,et al.ADAPT:a framework for coscheduling multithreaded programs[J].ACM Transactions on Architecture and Code Optimization,2013,9(4):45.
[19]  Stratton J A,et al.Parboil:a revised benchmark suite for scientific and commercial throughput computing[R].Illinois,US:Center for Reliable and High-Performance Computing of University of Illinois at Urbana-Champaign,2012.
[20]  Podlozhnyuk V,Harris M.Monte Carlo option pricing[R].California,US:NVIDIA Corporation,2008.
[21]  Govindaraju N K,Lloyd B,Dotsenko Y,et al.High performance discrete Fourier transforms on graphics processors[A].Proceedings of the 2008 ACM/IEEE Conference on Supercomputing[C].IEEE,2008.2.
[22]  Nyland L,Harris M,Prins J.Fast n-body simulation with cuda[J].GPU Gems,2007,3(1):677-696.
[23]  Che S,Boyer M,Meng J,et al.Rodinia:A benchmark suite for heterogeneous computing[A].Proceedings of the 2009 International Symposium on Workload Characterization[C].IEEE,2009.44-54.
[24]  Brodtkorb A R,Dyken C,Hagen T R,Hjelmervik J M,Storaasli O O.State-of-the-art in heterogeneous computing[J].Scientific Programming,2010,18(1):1-33.
[25]  Top 500 supercomputer sites [OL].http://www.top500.org/,2012-12.
[26]  Kothapalli K,Banerjee D S,et al.CPU and/or GPU:Revisiting the GPU Vs CPU Myth[J].arXiv,2013,1303(2171):1-20.
[27]  Saha B,Zhou X,Chen H,et al.Programming model for a heterogeneous x86 platform[A].Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation [C].ACM,2009.431-440.
[28]  Brodtkorb A R,et al.Graphics processing unit(GPU)programming strategies and trends in GPU computing[J].Journal of Parallel and Distributed Computing,2013,73(1):4-13.
[29]  Nvidia Corporation.Compute unified device architecture programming guide [OL].http://developer.nvidia.com/object/cuda.html,2007.
[30]  Yang Y,Xiang P,Mantor M,et al.CPU-assisted GPGPU on fused CPU-GPU architectures[A].Proceedings of the 18th International Symposium on High Performance Computer Architecture [C].IEEE,2012.1-12.
[31]  Daga M,Aji A M,Feng W.On the efficacy of a fused cpu+ gpu processor(or apu)for parallel computing[A].Proceedings of the Symposium on Application Accelerators in High-Performance Computing [C].IEEE,2011.141-149.
[32]  Han T D,Abdelrahman T S.hiCUDA:High-level GPGPU programming[J].IEEE Transactions on Parallel and Distributed Systems,2011,22(1):78-90.
[33]  Baskaran M M,Ramanujam J,Sadayappan P.Automatic C-to-CUDA code generation for affine programs[J].Compiler Construction,2010,6011:244-263.
[34]  Linderman M D,Collins J D,Wang H,et al.Merge:a programming model for heterogeneous multi-core systems[A].Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems[C].ACM,2008.287-296.
[35]  Dubach C,Cheng P,Rabbah R,et al.Compiling a high-level language for GPUs:(via language support for architectures and compilers)[A].Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2012.1-12.
[36]  Liu W,Lewis B,Zhou X,et al.A balanced programming model for emerging heterogeneous multicore systems[A].Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism[C].USENIX Association,2010.3-3.
[37]  Gelado I,Stone J E,Cabezas J,et al.An asymmetric distributed shared memory model for heterogeneous parallel systems[A].Proceedings of the 15th edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems[C].ACM,2010.347-358.
[38]  Ryoo S,Rodrigues C I,Baghsorkhi S S,et al.Optimization principles and application performance evaluation of a multithreaded GPU using CUDA[A].Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming[C].ACM,2008.73-82.
[39]  Baskaran M M,Bondhugula U,Krishnamoorthy S,et al.A compiler framework for optimization of affine loop nests for GPGPUs[A].Proceedings of the 22nd Annual International Conference on Supercomputing[C].ACM,2008.225-234.
[40]  Jang B,Schaa D,Mistry P,et al.Exploiting memory access patterns to improve memory performance in data-parallel architectures[J].IEEE Transactions on Parallel and Distributed Systems,2011,22(1):105-118.
[41]  Sundaram N,et al.A framework for efficient and scalable execution of domain-specific templates on GPUs[A].Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing[C].IEEE,2009.1-12.
[42]  He B,Fang W,Luo Q,et al.Mars:a MapReduce framework on graphics processors[A].Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2008.260-269.
[43]  Liu Y,Zhang E Z,Shen X.A cross-input adaptive framework for GPU program optimizations[A].Proceedings of the 2009 International Symposium on Parallel & Distributed Processing[C].IEEE,2009.1-10.
[44]  Lee J,Lakshminarayana N B,et al.Many-thread aware prefetching mechanisms for gpgpu applications[A].Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture [C].IEEE,2010.213-224.
[45]  Yang Y,Xiang P,et al.A GPGPU compiler for memory optimization and parallelism management[A].Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation[C].ACM,2010.86-97.
[46]  Yang Y,Xiang P,et al.Shared memory multiplexing:a novel way to improve GPGPU throughput[A].Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques[C].ACM,2012.283-292.
[47]  张保,等.面向图形处理器重叠通信与计算的数据划分方法[J].西安交通大学学报,2011,45(4):1-6. Zhang Bao,et al.Novel GPU data partitioning method to overlap communication and computation[J].Journal of Xi''an Jiaotong University,2011,45(4):1-6.(in Chinese)
[48]  Volkov V,Demmel J W.Benchmarking GPUs to tune dense linear algebra[A].Proceedings of the 2008 ACM/IEEE Conference on Supercomputing[C].IEEE,2008.31.

Full-Text

comments powered by Disqus