high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

Quinn Leo Pham

Department of Computing Science, University of Alberta

University of Alberta, 2025

DOI:10.7939/83169

@article{pham2025decoupled,

title={Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels},

author={Pham, Quinn L},

year={2025}

}

Download (PDF)

View

Source

848

views

Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: the implicit algorithm, which defines the tensor operation that computes the output tensor, and the schedule, which defines how the operation is implemented. The schedule of an ML kernel determines performance factors such as memory access patterns and vectorization. Therefore, an efficient schedule is necessary for an efficient ML kernel. Unfortunately, finding an efficient schedule for a given ML kernel is difficult and may require intimate knowledge of the hardware that executes the ML kernel. A decoupled language represents programs as the combination of a modular algorithm and a modular schedule. Triton is a high-abstraction language and compiler for parallel programming that is commonly used to write high-performance ML kernels. Triton follows a block-level programming paradigm that requires programmers to define ML kernels at the thread-block level. This programming paradigm allows developers to ignore low-level details such as shared memory and thread coalescence, which are automatically handled by the Triton compiler. Despite these abstractions, writing efficient Triton kernels by hand remains a tedious and error-prone experience for two reasons: Triton kernels still include some low-level details – such as pointer arithmetic – and iterating on a kernel schedule in Triton is difficult due to a tight coupling between the algorithm and the schedule. Furthermore, higher-level ML frameworks that generate Triton kernels, like PyTorch, may not generate efficient schedules and do not allow users to provide their own scheduling. We propose Decoupled Triton (DT), a block-level decoupled Domain Specific Language (DSL) and compiler for writing parallel tensor kernels. The DT compiler takes a modular algorithm and schedule defined in the DT DSL and generates a Triton kernel. Thus, DT acts as an abstraction layer on top of Triton, decoupling the algorithm from the schedule. The block-level programming paradigm, adopted from Triton, allows for simple and intuitive scheduling, while the decoupling of the algorithm from the schedule makes ML kernel schedule exploration easy and fast. We demonstrate that DT enables developers to rapidly explore schedule spaces and find efficient ML kernels with performance matching, and even exceeding, that of both ML kernels hand-written by expert Triton developers and ML kernels generated by PyTorch, a mature ML programming framework.

Tags: Compilers, Computer science, Machine learning, nVidia, nVidia RTX 5000 Ada, PyTorch, Thesis, Triton

December 7, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

Your response

Recent source codes

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

MSKernelBench & CUDAMaster

EvoScientist: Harness Vibe Research with Self-evolving AI Scientists

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

CONCUR: a benchmark designed to evaluate multithreaded Java code generated by LLMs

HIPRT: Ray Tracing using HIP

MXFP4 Training Support Codebase

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Most viewed papers (last 30 days)

Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)