Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

Kazuya Matsumoto, Naohito Nakasato, Stanislav G. Sedukhin
Graduate School of Computer Science and Enginering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan
6th IEEE International Symposium on Embedded Multicore SoCs (MCSoC-12), 2012


   title={Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU},

   author={Matsumoto, K. and Nakasato, N. and Sedukhin, S.G. and Tsuruga, I.M. and City, A.W.},



Download Download (PDF)   View View   Source Source   



This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: