|
- 2016
面向众核系统的线程分组映射方法
|
Abstract:
为了使应用线程更合理地映射到众核处理器具体处理核上,提出一种利用不同线程内部数据局部性及不同线程间数据相关性的特点、结合具体硬件架构特征的线程分组映射方法。通过计算数据重用距离,分析应用程序线程内部数据局部性,用线程相关性矩阵度量不同线程间的数据相关性;根据应用程序数据相关性及众核处理器硬件架构特点,通过设计数据相关性子树生成算法,将应用线程分为能反映不同线程数据访问特点的逻辑组;在线程逻辑分组的基础上,通过线程到处理核的绑定实现线程到具体处理器不同处理核硬件线程的合理映射。实验结果表明:与传统映射方法相比,该线程分组映射方法在不产生额外运行时开销的基础上,计算性能平均提高了14%,能耗降低了12%。该方法可以根据应用程序不同线程之间的数据相关性,将不同线程合理映射到具体众核处理器不同处理核上,在不引入额外运行时开销的基础上,提升众核系统的计算效能。
A grouping mapping mechanism of threads is proposed to reasonably map application threads to specific processing cores of a many??core processor according to the characteristics of applications. The mechanism bases on the data locality of intra??thread and the data correlation of inter??threads, and combines with the features of hardware architecture of many??core processor. The locality of intra??thread data is analyzed by computing the data reuse distance, and the correlation of inter??threads data is quantified by using a affinity matrix. Threads are divided into different logical groups by designing an algorithm to generate affinity spanning subtree. The reasonable mapping from application to core is realized by binding the thread to the processing core. Experimental results and a comparison with a traditional mapping mechanism show that the proposed mapping mechanism obtains nearly 14% improvement in computing performance and 12% reduction in energy consumption without introducing additional runtime overhead. The mechanism reasonably maps application threads to specific processing cores of many??core processors, and improves computing efficiency of many??core systems
[1] | [3]ZHANG Yuanrui, KANDEMIR M, YEMLIHA T. Studying inter??core data reuse in multicores [J]. ACM Sigmetrics Performance Evaluation Review, 2011, 39(1): 25??36. |
[2] | [4]WU Mengju, YEUNG D. Efficient reuse distance analysis of multicore scaling for loop??based parallel programs [J]. ACM Transactions on Computer Systems, 2013, 31(1): 1??37. |
[3] | [5]MURALIDHARA S P, KANDEMIR M, KISLAL O. Reuse distance based performance modeling and workload mapping [C]∥Proceedings of the 9th Conference on Computing Frontiers. New York, USA: ACM, 2012: 193??202. |
[4] | [6]MATTHIAS D, EDUARDO H M C, PHILIPPE O A, et al. kMAF: automatic kernel??level management of thread and data affinity [C]∥Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. New York, USA: ACM, 2014: 277??288. |
[5] | [9]XIANG Xiaoya, DING Chen, LUO Hao, et al. HOTL: a higher order theory of locality [C]∥Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM, 2013: 343??356. |
[6] | [10]BACH M, CHARNEY M, COHN R, et al. Analyzing parallel programs with pin [J]. IEEE Computer, 2010, 43(3): 34??41. |
[7] | [11]DEREK L, MILIND K, VIJAY S P. Accelerating multicore reuse distance analysis with sampling and parallelization [C]∥Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. New York, USA: ACM, 2010: 53??64. |
[8] | [1]BRODTKORB A R, DYKEN C, HAGEN T R, et al. State??of??the??art in heterogeneous computing [J]. Scientific Programming, 2010, 18(1): 1??33. |
[9] | [2]巨涛, 朱正东, 董小社. 异构众核系统及其编程模型与性能优化技术研究综述 [J]. 电子学报, 2015, 43(1): 111??119. |
[10] | JU Tao, ZHU Zhendong, DONG Xiaoshe. The feature, programming model and performance optimization strategy of heterogeneous many??core system: a review [J]. Chinese Journal of Electronics, 2015, 43(1): 111??119. |
[11] | [7]DING Wei, KANDEMIR M, YEDLAPALLI P, et al. Locality??aware mapping and scheduling for multicores [C]∥Proceedings of the International Symposium on Code Generation and Optimization. Piscataway, NJ, USA: IEEE, 2013: 1??12. |
[12] | [8]EDUARDO H M, MATTHIAS D, MARCO A Z, et al. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols [J]. Journal of Parallel and Distributed Computing, 2014, 74(3): 2215??2228. |
[13] | [12]NIU Qingpeng, DINAN J, LU Qinda, et al. PARDA: a fast parallel reuse distance analysis algorithm [C]∥Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. Piscataway, NJ, USA: IEEE, 2012: 1284??1294. |
[14] | [13]张保, 曹海军, 董小社, 等. 面向图形处理器重叠通信与计算的数据划分方法 [J]. 西安交通大学学报, 2011, 45(4): 1??5. |
[15] | ZHANG Bao, CAO Haijun, DONG Xiaoshe, et al. Novel GPU data partitioning method to overlap communication and computation [J]. Journal of Xi’an Jiaotong University, 2011, 45(4): 1??5. |
[16] | [14]BIENIA C, KUMAR S, SINGH J P, et al. The PARSEC benchmark suite: characterization and architectural implications [C]∥Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. New York, USA: ACM, 2008: 72??81. |
[17] | [15]DAN T, JAGODE H, YOU H, et al. Collecting performance data with PAPI??C [J]. Tools for High Performance Computing. Berlin, Germany: Springer Verlag, 2009: 157??173. |
[18] | [16]WEAVER V M, JOHNSON M, KASICHAYANULA K, et al. Measuring energy and power with PAPI [C]∥Proceedings of the IEEE International Conference on Parallel Processing Workshops. Piscataway, NJ, USA: IEEE, 2012: 262??268. |