|
- 2016
基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计
|
Abstract:
矩阵运算是高性能计算中核心问题之一,矩阵分解是提高矩阵运算并行性的重要途径,飞速发展的FPGA为并行运算结构提供了有力的环境支持。该文基于子矩阵更新同一化算法实现了Cholesky分解,基于FPGA设计了相应的并行结构。实验结果表明:与通用处理器的软件实现相比,本文实现的Cholesky分解的FPGA并行结果在核心计算性能上可以取得10倍以上的加速比,该算法针对矩阵三角化计算过程具有更高的数据和流水并行性。
Abstract:Matrix computing is one of the core problems in high performance computing with matrix decomposition being an important way to improve the parallelism of matrix computations. FPGA gives a powerful environment for parallel computing. This study uses Cholesky decomposition based on a hardware-adaptive parallel sub-matrix identity updating algorithm. Its parallel structure is based on FPGA. Tests show that this structure achieves more than 10 fold speedup compared to general-purpose processors in terms of the kernel computational speed because the algorithm has better data-parallelism and pipeline-parallelism during matrix triangularization.
[1] | Vasily V, James W D. LU, QR and Cholesky factorizations using vector capabilities of GPUs [R]. Berkeley, CA, USA: EECS Department, University of California, Berkeley, 2008. |
[2] | Aatonio R, George A. A high throughput FPGA-based floating point conjugate gradient implementation for dense matrices [J]. ACM Transactions on Reconfigurable Technology and Systems, 2010, 3(1): 1-19. |
[3] | Badia J M, Vidal A M. Parallel algorithms to computer the eigenvalues and eigenvectors of symmetric Toeplitz matrices [J]. Parallel Algorithms and Applications, 1998, 13(1): 75-93. |
[4] | Anderson E, BAI Z, Bischof C, et al. LAPACK Users Guide [M]. 3rd ED. Philadelphia, PA, USA: The society of industrial and applied mathematics, 1999. |
[5] | Blackford L S, Choi J, Cleary A, et al. ScaLAPACK Users Guide [M]. Philadelphia, PA, USA: The Society of Industrial and Applied Mathematics, 1997. |
[6] | WU Guiming, DOU Yong, SUN Junqing, et al. A high performance and memory efficient LU decomposer on FPGAs [J]. Computers, IEEE Transactions, 2012, 61(3): 366-378. |
[7] | Haridas S G, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared memory multiprocessor architecture [J]. International Journal of Parallel, Emergent and Distributed Systems, 2004, 19(4): 211-226. |
[8] | 郭磊, 唐玉华, 周杰, 等. 基于FPGA的Cholesky分解细粒度并行结构与实现 [J]. 计算机研究与发展, 48 (Suppl), 2011: 258-265.GUO Lei, TANG Yuhua, ZHOU Jie, et al. Fine-grained architecture and implementation for Cholesky decomposition on FPGA [J]. Journal of Computer Research and Development, 2011, 48(suppl): 258-265. (in Chinese) |
[9] | 邬贵明, 窦勇, 王淼. Cholesky分解细粒度并行算法 [J]. 计算机工程与科学, 2010, 32(9): 102-106.WU Guiming, DOU Yong, WANG Miao. A fine-grained parallel algorithm for the Cholesky decomposition [J]. Computer Engineering & Science, 2010, 32(9): 102-106. (in Chinese) |
[10] | 刘书勇,吴艳霞,张博文,等. 基于可重构计算系统的矩阵三角化分解硬件并行结构研究 [J]. 电子学报, 43(8), 2015: 1642-1650. LIU Shuyong, WU Yanxia, ZHANG Bowen, et al. Research of parallel hardware architecture for matrix triangularization decomposition based on reconfigurable computing system [J]. ACTA Electronica Sinica, 2015, 43 (8): 1642-1650. (in Chinese) |
[11] | Satchidanand G H, Sotirios G Z. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture [J]. Parallel Algorithms and Applications, 2004, 19(4): 211-226. |
[12] | YANG Depeng, Gregory D P, LI Husheng. Compressed sensing and Cholesky decomposition on FPGAs and GPUs [J]. Parallel Computing, 2012, 38(8): 421-437. |
[13] | Alfredo B, Julien L, Jakub K, et al. A class of parallel tiled linear algebra algorithms for multicore architectures [J]. Parallel Computing, 2009, 35(1): 38-53. |
[14] | DOU Yong, Vassiliadis S, Kuzmanov G, et al. 64-bit floating-point FPGA matrix multiplication [C]//Proceedings of the 13th ACM/SIGDA International Symposium on Field Programmable Gate Arrays(FPGA’05), New York, NY, USA: ACM, 2005: 86-95. |
[15] | Oleg M, Volodymyr L, Anatoli S, et al. Parallel implementation of Cholesky LLT-algorithm in FPGA-based processor [J]. Parallel Processing and Applied Mathematics, 2008, 49(67): 137-147. |
[16] | Eunice E S, CHU Peiyun. Efficient and optimal parallel algorithms for Cholesky decomposition [J]. Journal of Mathematical Modeling and Algorithms, 2003, 2(3): 217-234. |
[17] | Kung H T, Leiserson C E. Systolic arrays technical report [R]. Pitsburg, PA, USA: Carneige Mellon University, 1978. |
[18] | WU Guiming, DOU Yong, Gregory D P. Blocking LU decomposition for FPGAs [C]//Proceedings of the 18th IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM’10). Charlotte, NC,USA: IEEE, 2010, 109-112. |
[19] | WANG Xiaojun, Leeser M. A truly two-dimensional systolic array FPGA implementation of QR decomposition [J]. ACM Transactions on Embedded Computing Systems (TECS), 2009, 9(1): 17-18. |