high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Performance Portable GPU Code Generation for Matrix Multiplication

Performance Portable GPU Code Generation for Matrix Multiplication

Toomas Remmelg, Thibaut Lutz, Michel Steuwer, Christophe Dubach

University of Edinburgh

The 9th Workshop on General Purpose Processing using GPUs (GPGPU), 2016

BibTeX

Download (PDF)

View

Source

2042

views

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device. Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent devicespecific forms, from which OpenCL code is generated. In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a wellstudied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized – but provably correct – implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD’s clBLAS library.

Tags: ATI, ATI Radeon HD 7970, Code generation, Compilers, Computer science, Matrix multiplication, nVidia, nVidia GeForce GTX 480, nVidia GeForce GTX Titan Black, OpenCL

February 10, 2016 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Performance Portable GPU Code Generation for Matrix Multiplication

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Performance Portable GPU Code Generation for Matrix Multiplication

Share this:

Recent source codes

Most viewed papers (last 30 days)