30570

Deep Kernel Fusion for Transformers

Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins
Imperial College London, London, UK
arXiv:2602.11808 [cs.LG], (12 Feb 2026)

@misc{zhang2026deep,

   title={Deep Kernel Fusion for Transformers},

   author={Zixi Zhang and Zhiwen Mo and Yiren Zhao and Robert Mullins},

   year={2026},

   eprint={2602.11808},

   archivePrefix={arXiv},

   primaryClass={cs.LG},

   url={https://arxiv.org/abs/2602.11808}

}

Download Download (PDF)   View View   Source Source   

520

views

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: