Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein
Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany
arXiv:1905.03136 [cs.PF], (8 May 2019)


   title={Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs},

   author={Dominik Ernst and Georg Hager and Jonas Thies and Gerhard Wellein},






General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.
Rating: 2.0/5. From 1 vote.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: