high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Deep Kernel Fusion for Transformers

Deep Kernel Fusion for Transformers

Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins

Imperial College London, London, UK

arXiv:2602.11808 [cs.LG], (12 Feb 2026)

DOI:10.48550/arXiv.2602.11808

@misc{zhang2026deep,

title={Deep Kernel Fusion for Transformers},

author={Zixi Zhang and Zhiwen Mo and Yiren Zhao and Robert Mullins},

year={2026},

eprint={2602.11808},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2602.11808}

}

Download (PDF)

View

Source

520

views

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.

Tags: Computer science, CUDA, LLM, nVidia, nVidia A100, Performance

February 16, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org

Deep Kernel Fusion for Transformers

Your response

Recent source codes

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

Most viewed papers (last 30 days)

Deep Kernel Fusion for Transformers

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)