Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels
Department of Computing Science, University of Alberta
University of Alberta, 2025
DOI:10.7939/83169
@article{pham2025decoupled,
title={Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels},
author={Pham, Quinn L},
year={2025}
}
Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: the implicit algorithm, which defines the tensor operation that computes the output tensor, and the schedule, which defines how the operation is implemented. The schedule of an ML kernel determines performance factors such as memory access patterns and vectorization. Therefore, an efficient schedule is necessary for an efficient ML kernel. Unfortunately, finding an efficient schedule for a given ML kernel is difficult and may require intimate knowledge of the hardware that executes the ML kernel. A decoupled language represents programs as the combination of a modular algorithm and a modular schedule. Triton is a high-abstraction language and compiler for parallel programming that is commonly used to write high-performance ML kernels. Triton follows a block-level programming paradigm that requires programmers to define ML kernels at the thread-block level. This programming paradigm allows developers to ignore low-level details such as shared memory and thread coalescence, which are automatically handled by the Triton compiler. Despite these abstractions, writing efficient Triton kernels by hand remains a tedious and error-prone experience for two reasons: Triton kernels still include some low-level details – such as pointer arithmetic – and iterating on a kernel schedule in Triton is difficult due to a tight coupling between the algorithm and the schedule. Furthermore, higher-level ML frameworks that generate Triton kernels, like PyTorch, may not generate efficient schedules and do not allow users to provide their own scheduling. We propose Decoupled Triton (DT), a block-level decoupled Domain Specific Language (DSL) and compiler for writing parallel tensor kernels. The DT compiler takes a modular algorithm and schedule defined in the DT DSL and generates a Triton kernel. Thus, DT acts as an abstraction layer on top of Triton, decoupling the algorithm from the schedule. The block-level programming paradigm, adopted from Triton, allows for simple and intuitive scheduling, while the decoupling of the algorithm from the schedule makes ML kernel schedule exploration easy and fast. We demonstrate that DT enables developers to rapidly explore schedule spaces and find efficient ML kernels with performance matching, and even exceeding, that of both ML kernels hand-written by expert Triton developers and ML kernels generated by PyTorch, a mature ML programming framework.
December 7, 2025 by hgpu
Your response
You must be logged in to post a comment.




