high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » How much can we gain from Tensor Kernel Fusion on GPUs?

How much can we gain from Tensor Kernel Fusion on GPUs?

Wei Sun, Ang Li, Sander Stuijk, Henk Corporaal

Electronic System Group, Eindhoven University of Technology, the Netherlands

IEEE Access, 2024

DOI:10.1109/ACCESS.2024.3411473

@article{sun2024much,

title={How much can we gain from Tensor Kernel Fusion on GPUs?},

author={Sun, Wei and Li, Ang and Stuijk, Sander and Corporaal, Henk},

journal={IEEE Access},

year={2024},

publisher={IEEE}

}

View

Source

1705

views

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared memory. This strategy has the potential to not only boost performance, but also reduce energy consumption. Typically, GPU kernels fall into two categories: tensor operations and element operations. In deep learning, fusing a tensor operation kernel with an element operation kernel that follows it, such as combining convolution with ReLU, is a common practice to achieve improved performance. While combining two tensor kernels in a single GPU kernel has shown benefits in certain applications, it is not a straightforward task. The advantages and limitations of this approach remain unclear, prompting several questions: 1) What advantages does tensor kernel fusion offer on GPGPUs? 2) What limitations does it have and why is it not widely adopted? 3) In what practical scenarios is tensor kernel fusion beneficial? To address these questions, we conducted both analytical and experimental studies on Nvidia Tensor Core GPUs, using the CUTLASS kernel library with extensions. Our experimental findings revealed that for tall and narrow matrix multiplications, employing a 1D tiling strategy outperforms the commonly used 2D tiling strategy. By comparing tensor kernel fusions with a 1D tiling baseline, we demonstrated significant performance gains for tall and narrow matrix multiplications with fusion. However, we observe that these benefits diminish as the matrix sizes increase in width.

Tags: Computer science, CUDA, Deep learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, nVidia H100

June 16, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

Inside VOLT: Designing an Open-Source GPU Compiler (Tool)

SciDef: Automated Definition Extraction from Scientific Literature

SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Generating Literature-Driven Scientific Theories at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)

Awesome LLM-Driven Kernel Generation

Towards Automated Kernel Generation in the Era of LLMs

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: