Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Graduate School of Computer Science and Engineering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan
Technical Report 2014-001, 2014
@article{matsumoto2014implementing,
title={Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units},
author={Matsumoto, Kazuya and Nakasato, Naohito and Sedukhin, Stanislav and Tsuruga, Ikki-Machi and City, Aizu-Wakamatsu},
year={2014}
}
This paper presents an implementation of different matrix-matrix multiplication routines in OpenCL. We utilize the high-performance GEMM (GEneral Matrix-Matrix Multiply) implementation from our previous work for the present implementation of other matrix-matrix multiply routines in Level-3 BLAS (Basic Linear Algebra Subprograms). The other routines include SYMM (Symmetric Matrix-Matrix Multiply), SYRK (Symmetric Rank-K Update), SYR2K (Symmetric Rank-2K Update), and TRMM (Triangular Matrix-Matrix Multiply). A key in our approach is to copy given matrix data by copying OpenCL kernels into a form such that a high-performance GEMM kernel can be utilized for computation. We use a previously developed auto-tuning system for the highly optimized copying kernels as well as for GEMM kernel. The performance evaluation of our implementation is conducted on four different GPUs (AMD Radeon R9 290X, FirePro W9100, Radeon HD 7970, and NVIDIA GeForce GTX Titan), a many-core processor (Intel Xeon Phi 5110P), and a multi-core processor (Core i7 3960X). The evaluation results show that the tuning on the copying kernels is effective and contributes to develop high-performance BLAS3 routines.
October 29, 2014 by hgpu