Autotuning Tensor Contraction Computations on GPUs
School of Computing, University of Utah, Salt Lake City, UT
University of Utah, 2014
@article{rivera2014autotuning,
title={Autotuning Tensor Contraction Computations on GPUs},
author={Rivera, Axel and Hall, Mary and Salt Lake City, UT and Hovland, Paul D and Argonne, IL and Jessup, Elizabeth and Nelson, Thomas and Boulder, CO and Norris, Boyana and Eugene, OR},
year={2014}
}
We describe a framework for generating optimized GPU code for computing tensor contractions, a multidimensional generalization of matrix-matrix multiplication that arises frequently in computational science applications. Typical performance optimization strategies for such computations transform the tensors into sequences of matrix-matrix multiplications to take advantage of an optimized BLAS library, but this approach is not appropriate for small tensors. We instead develop an autotuning strategy that generates CUDA variants from a sequential implementation and identifies the best-performing variant. We compare our generated code with that of OpenACC when offloading the same computation to the GPU. The straightforward OpenACC implementation is as much as 23X slower than our automatically generated code for benchmarks representative of two large-scale tensor contraction computations, Nek5000 and NWChem. However, we show how changes in GPU thread-block decomposition and register placement of data in the OpenACC annotations can achieve comparable performance to our automatically generated code. This result highlights limitations of the OpenACC compiler in targeting GPUs for computations such as tensor contractions with small trip counts and large dimensionality. It also suggests additional optimizations that can overcome these limitations.
June 19, 2015 by hgpu