high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » How much can we gain from Tensor Kernel Fusion on GPUs?

How much can we gain from Tensor Kernel Fusion on GPUs?

Wei Sun, Ang Li, Sander Stuijk, Henk Corporaal

Electronic System Group, Eindhoven University of Technology, the Netherlands

IEEE Access, 2024

DOI:10.1109/ACCESS.2024.3411473

@article{sun2024much,

title={How much can we gain from Tensor Kernel Fusion on GPUs?},

author={Sun, Wei and Li, Ang and Stuijk, Sander and Corporaal, Henk},

journal={IEEE Access},

year={2024},

publisher={IEEE}

}

Download (PDF)

View

Source

546

views

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared memory. This strategy has the potential to not only boost performance, but also reduce energy consumption. Typically, GPU kernels fall into two categories: tensor operations and element operations. In deep learning, fusing a tensor operation kernel with an element operation kernel that follows it, such as combining convolution with ReLU, is a common practice to achieve improved performance. While combining two tensor kernels in a single GPU kernel has shown benefits in certain applications, it is not a straightforward task. The advantages and limitations of this approach remain unclear, prompting several questions: 1) What advantages does tensor kernel fusion offer on GPGPUs? 2) What limitations does it have and why is it not widely adopted? 3) In what practical scenarios is tensor kernel fusion beneficial? To address these questions, we conducted both analytical and experimental studies on Nvidia Tensor Core GPUs, using the CUTLASS kernel library with extensions. Our experimental findings revealed that for tall and narrow matrix multiplications, employing a 1D tiling strategy outperforms the commonly used 2D tiling strategy. By comparing tensor kernel fusions with a 1D tiling baseline, we demonstrated significant performance gains for tall and narrow matrix multiplications with fusion. However, we observe that these benefits diminish as the matrix sizes increase in width.

Tags: Computer science, CUDA, Deep learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, nVidia H100

June 16, 2024 by hgpu

No votes yet.

Please wait...

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

CuPBoP-AMD: Extending CUDA to AMD Platforms

CuPBoP: Making CUDA a Portable Language

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

Automated Deep Learning Optimization via DSL-Based Source Code Transformation

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

How much can we gain from Tensor Kernel Fusion on GPUs?

Recent source codes

Optimal Kernel Orchestration for Tensor Programs with Korch

Astaroth: A Scalable Multi-GPU Library for Stencil Computations

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking

Autotuning Methodology Software Package

Fast and Practical FPGA-based Strassen's Matrix Multiplication

HAL's MD package: Highly Accelerated Large-scale Molecular Dynamics simulations

Improved Models for Policy-Agent Learning of Compiler Directives in HLS

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

Most viewed papers (last 30 days)

How much can we gain from Tensor Kernel Fusion on GPUs?

Share this:

Recent source codes

Most viewed papers (last 30 days)