Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU
Graduate School of Computer Science and Enginering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan
6th IEEE International Symposium on Embedded Multicore SoCs (MCSoC-12), 2012
@article{matsumoto2012implementing,
title={Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU},
author={Matsumoto, K. and Nakasato, N. and Sedukhin, S.G. and Tsuruga, I.M. and City, A.W.},
year={2012}
}
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
July 15, 2012 by hgpu