异构平台上基于OpenCL的矩阵乘并行算法

肖汉; 肖诗洋; 李彩林; 周清雷

doi:10.13718/j.cnki.xdzk.2020.11.017

HOSSEINI RAD M, PATOOGHY A, FAZELI M. An Efficient Programming Skeleton for Clusters of Multi-Core Processors [J]. International Journal of Parallel Programming, 2018, 46(6): 1094-1109. doi: 10.1007/s10766-017-0517-y

FIALKO S. Parallel Direct Solver for Solving Systems of Linear Equations Resulting from Finite Element Method on Multi-Core Desktops and Workstations [J]. Computers & Mathematics with Applications, 2015, 70(12): 2968-2987.

CABRERA W, ORDONEZ C. Scalable Parallel Graph Algorithms with Matrix-Vector Multiplication Evaluated with Queries [J]. Distributed and Parallel Databases, 2017, 35(3-4): 335-362. doi: 10.1007/s10619-017-7200-6

PARK S M, CHANG K Y, HONG D, et al. Subquadratic Space Complexity Multiplier Using Even Type GNB Based on Efficient Toeplitz Matrix-Vector Product [J]. IEEE Transactions on Computers, 2018, 67(12): 1794-1805. doi: 10.1109/TC.2018.2836425

LIMA F A, MORENO E D, DIAS W R A. Performance Analysis of a Low Cost Cluster with Parallel Applications and ARM Processors [J]. IEEE Latin America Transactions, 2016, 14(11): 4591-4596. doi: 10.1109/TLA.2016.7795834

ACER S, TORUN T, AYKANAT C. Improving Medium-Grain Partitioning for Scalable Sparse Tensor Decomposition [J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(12): 2814-2825. doi: 10.1109/TPDS.2018.2841843

LIANG Y, TANG W T, ZHAO R Z, et al. Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures [J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(12): 2106-2119. doi: 10.1109/TCAD.2017.2681072

KRUCHININA A, RUDBERG E, RUBENSSON E H. Parameterless Stopping Criteria for Recursive Density Matrix Expansions [J]. Journal of Chemical Theory and Computation, 2016, 12(12): 5788-5802. doi: 10.1021/acs.jctc.6b00626

ZHENG D, MHEMBERE D, LYZINSKI V, et al. Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs [J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(5): 1470-1483. doi: 10.1109/TPDS.2016.2618791

崔翔, 李晓雯, 陈一峯.基于Parray数组类型的矩阵乘法实现[J].计算机学报, 2014, 37(12): 2564-2573.

周磊涛, 陶耀东, 刘生, 等.基于FPGA的Systolic乘法技术研究[J].计算机工程与科学, 2015, 37(9): 1632-1636.

刘沛华, 鲁华祥, 龚国良, 等.基于FPGA的全流水双精度浮点矩阵乘法器设计[J].智能系统学报, 2012, 7(4): 302-306.

朱敏, 唐波, 赵娟, 等.布尔矩阵乘的分布式异构并行优化[J].计算机工程与科学, 2017, 39(4): 634-640.

LASTOVETSKY A, REDDY MANUMACHU R. New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters [J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(4): 1119-1133. doi: 10.1109/TPDS.2016.2608824

王云龙, 吴瑛.基于GPU的相关干涉仪算法实现[J].信息工程大学学报, 2015, 16(1): 41-45.

张梦元.基于CUDA的矩阵乘法的并行实现[J].信息通信, 2012(2): 20-21.

BERI T, BANSAL S, KUMAR S. The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters [J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(5): 1518-1534. doi: 10.1109/TPDS.2016.2616314

龙卓群, 王晓瑜, 王昌明.基于DCT预测编码的Epiphany-OpenCL大矩阵乘并行计算[J].自动化与仪表, 2017, 32(7): 16-21.

刘鹏, 王学奎, 黄宜华, 等.基于Spark的极限学习机算法并行化研究[J].计算机科学, 2017, 44(12): 33-37.

GU R, TANG Y, TIAN C, et al. Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms [J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(9): 2539-2552. doi: 10.1109/TPDS.2017.2686384