high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » High-Performance Tensor Contractions for GPUs

High-Performance Tensor Contractions for GPUs

A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, S. Tomov

Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA

University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738, 2016

@article{abdelfattah2016high,

title={High-Performance Tensor Contractions for GPUs},

author={Abdelfattah, A and Baboulin, M and Dobrev, V and Dongarra, J and Earl, C and Falcou, J and Haidar, A and Karlin, I and Kolev, Tz and Masliah, I and others},

year={2016}

}

Download (PDF)

View

Source

5575

views

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8x faster than CUBLAS, and 8.5x faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

Tags: Algorithms, Code generation, Computer science, CUDA, FEM, Finite element method, Linear Algebra, Matrix multiplication, nVidia, OpenMP, Tesla K40

January 29, 2016 by hgpu

Rating: 2.5/5. From 3 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

High-Performance Tensor Contractions for GPUs

Your response

Recent source codes

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

Most viewed papers (last 30 days)

High-Performance Tensor Contractions for GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)