Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study
School of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 73, 11000 Belgrade, Serbia
Electronics, 15(5), 1034, (2026)
@article{dodovic2026analyzing,
title={Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study},
author={Dodovi{‘c}, Matija and Veselinovi{‘c}, Milica and Mi{v{s}}i{‘c}, Marko},
journal={Electronics},
volume={15},
number={5},
pages={1034},
year={2026},
publisher={MDPI}
}
Large numbers of small tensor kernels are executed by GPUs in modern deep learning frameworks, where total performance is frequently constrained by memory bandwidth and kernel launch overheads. Systems such as TensorFlow XLA, PyTorch JIT, and cuDNN often use kernel fusion, which is defined as combining many tensor operations into a single GPU kernel, to reduce intermediate memory transfers and boost efficiency. Nevertheless, it is difficult to measure the true performance impact of fusion on both isolated tensor operations and end-to-end model execution. An experimental investigation of kernel fusion on three different NVIDIA GPUs is presented in this work. For four sample tensor operations: element-wise addition, fused multiply–add, linear transformation with ReLU activation, and map-reduce, we build fused and unfused CUDA kernels using FP32, FP16, and mixed-precision arithmetics. We measure execution time, speedup, and effective memory bandwidth across a range of input sizes. For memory-bound and activation-heavy workloads, fusion yields consistent speedups between 1.5× and 3.13×, particularly for small and medium inputs where kernel launch overhead is significant. For operations dominated by atomic updates, the benefit is limited to between 1.01× and 1.44×. When the reduction strategy is reformulated using block-level shared-memory aggregation, kernel fusion becomes effective again, achieving speedups of up to 2× by eliminating global synchronization bottlenecks. We further evaluate the effect of fusion on image classification models using PyTorch 2.10.0 JIT, achieving 1.54× to 1.83× faster inference. Our results provide practical guidelines on when kernel fusion is most effective.
May 20, 2026 by hgpu
Your response
You must be logged in to post a comment.





