Performance Portable GPU Code Generation for Matrix Multiplication

hgpu.org » Applications » Computer science » Performance Portable GPU Code Generation for Matrix Multiplication

Performance Portable GPU Code Generation for Matrix Multiplication

Toomas Remmelg, Thibaut Lutz, Michel Steuwer, Christophe Dubach

University of Edinburgh

The 9th Workshop on General Purpose Processing using GPUs (GPGPU), 2016

@article{remmelg2016performance,

title={Performance Portable GPU Code Generation for Matrix Multiplication},

author={Remmelg, Toomas and Lutz, Thibaut and Dubach, Michel Steuwer Christophe},

year={2016}

}

Download (PDF)

View

Source

1648

views

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device. Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent devicespecific forms, from which OpenCL code is generated. In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a wellstudied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized – but provably correct – implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD’s clBLAS library.

Tags: ATI, ATI Radeon HD 7970, Code generation, Compilers, Computer science, Matrix multiplication, nVidia, nVidia GeForce GTX 480, nVidia GeForce GTX Titan Black, OpenCL

February 10, 2016 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org