Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation

hgpu.org » Applications » Computer science » Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation

Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation

Michel Steuwer, Toomas Remmelg, Christophe Dubach

University of Edinburgh

International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES ’16), 2016

@article{steuwer2016matrix,

title={Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation},

author={Steuwer, Michel and Remmelg, Toomas and Dubach, Christophe},

year={2016}

}

Download (PDF)

View

Source

1704

views

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices. However, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs. Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics. In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a highlevel programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs. In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.

Tags: ARM, ATI, ATI Radeon HD 7970, BLAS, Code generation, Computer science, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce GTX Titan Black, OpenCL, performance portability

July 8, 2016 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org