high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » The Anatomy of a Triton Attention Kernel

The Anatomy of a Triton Attention Kernel

Burkhard Ringlein, Jan van Lunteren, Radu Stoica, Thomas Parnell

IBM Research, Zurich, Switzerland

arXiv:2511.11581 [cs.LG], (7 Oct 2025)

DOI:10.48550/arXiv.2511.11581

@misc{ringlein2025anatomytritonattentionkernel,

title={The Anatomy of a Triton Attention Kernel},

author={Burkhard Ringlein and Jan van Lunteren and Radu Stoica and Thomas Parnell},

year={2025},

eprint={2511.11581},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2511.11581}

}

Download (PDF)

View

Source

780

views

A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 105.9%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.

Tags: AMD Radeon Instinct MI250, AMD Radeon Instinct MI300X, ATI, Computer science, CUDA, DSL, HIP, LLM, nVidia, nVidia H100, Performance, Programming Languages, Triton

November 23, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

The Anatomy of a Triton Attention Kernel

Your response

Recent source codes

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

pplx-garden: Perplexity open source garden for inference technology

LC Framework

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

Most viewed papers (last 30 days)

The Anatomy of a Triton Attention Kernel

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)