Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

hgpu.org » Applications » Computer science » Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

Matija Dodović, Milica Veselinović, Marko Mišić

School of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 73, 11000 Belgrade, Serbia

Electronics, 15(5), 1034, (2026)

DOI:10.3390/electronics15051034

@article{dodovic2026analyzing,

title={Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study},

author={Dodovi{‘c}, Matija and Veselinovi{‘c}, Milica and Mi{v{s}}i{‘c}, Marko},

journal={Electronics},

volume={15},

number={5},

pages={1034},

year={2026},

publisher={MDPI}

}

Download (PDF)

View

Source

Source codes

Package:

CUDA Kernel Fusion Benchmarks

847

views

Large numbers of small tensor kernels are executed by GPUs in modern deep learning frameworks, where total performance is frequently constrained by memory bandwidth and kernel launch overheads. Systems such as TensorFlow XLA, PyTorch JIT, and cuDNN often use kernel fusion, which is defined as combining many tensor operations into a single GPU kernel, to reduce intermediate memory transfers and boost efficiency. Nevertheless, it is difficult to measure the true performance impact of fusion on both isolated tensor operations and end-to-end model execution. An experimental investigation of kernel fusion on three different NVIDIA GPUs is presented in this work. For four sample tensor operations: element-wise addition, fused multiply–add, linear transformation with ReLU activation, and map-reduce, we build fused and unfused CUDA kernels using FP32, FP16, and mixed-precision arithmetics. We measure execution time, speedup, and effective memory bandwidth across a range of input sizes. For memory-bound and activation-heavy workloads, fusion yields consistent speedups between 1.5× and 3.13×, particularly for small and medium inputs where kernel launch overhead is significant. For operations dominated by atomic updates, the benefit is limited to between 1.01× and 1.44×. When the reduction strategy is reformulated using block-level shared-memory aggregation, kernel fusion becomes effective again, achieving speedups of up to 2× by eliminating global synchronization bottlenecks. We further evaluate the effect of fusion on image classification models using PyTorch 2.10.0 JIT, achieving 1.54× to 1.83× faster inference. Our results provide practical guidelines on when kernel fusion is most effective.

Tags: Computer science, CUDA, Deep learning, nVidia, nVidia GeForce RTX 3080, Package

May 20, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org