high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Autotuning Tensor Contraction Computations on GPUs

Autotuning Tensor Contraction Computations on GPUs

Axel Rivera, Mary Hall, Paul D. Hovland, Elizabeth Jessup, Thomas Nelson, Boyana Norris

School of Computing, University of Utah, Salt Lake City, UT

University of Utah, 2014

BibTeX

Download (PDF)

View

Source

1775

views

We describe a framework for generating optimized GPU code for computing tensor contractions, a multidimensional generalization of matrix-matrix multiplication that arises frequently in computational science applications. Typical performance optimization strategies for such computations transform the tensors into sequences of matrix-matrix multiplications to take advantage of an optimized BLAS library, but this approach is not appropriate for small tensors. We instead develop an autotuning strategy that generates CUDA variants from a sequential implementation and identifies the best-performing variant. We compare our generated code with that of OpenACC when offloading the same computation to the GPU. The straightforward OpenACC implementation is as much as 23X slower than our automatically generated code for benchmarks representative of two large-scale tensor contraction computations, Nek5000 and NWChem. However, we show how changes in GPU thread-block decomposition and register placement of data in the OpenACC annotations can achieve comparable performance to our automatically generated code. This result highlights limitations of the OpenACC compiler in targeting GPUs for computations such as tensor contractions with small trip counts and large dimensionality. It also suggests additional optimizations that can overcome these limitations.

Tags: Benchmarking, Code generation, Computer science, CUDA, Matrix multiplication, nVidia, OpenACC, Tesla C2050, Tesla K20

June 19, 2015 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Autotuning Tensor Contraction Computations on GPUs

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Autotuning Tensor Contraction Computations on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)