Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

hgpu.org » Applications » Computer science » Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Hao Yu, Yida Wang, Gennady Pekhimenko

University of Toronto, Toronto, ON, Canada

31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’26), 2026

DOI:10.1145/3760250.3762219

@inproceedings{ding2026tilus,

title={Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation},

author={Ding, Yaoyao and Hou, Bohan and Zhang, Xiao and Lin, Allan and Chen, Tianqi and Yu, Cody Hao and Wang, Yida and Pekhimenko, Gennady},

booktitle={Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1},

pages={281–297},

year={2026}

}

Download (PDF)

View

Source

Source codes

Package:

Tilus: A Tile-Level GPU Kernel Programming Language

898

views

Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance because of high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, that are essential for efficient low-precision computations. In this paper, we introduce Tilus, a domain-specific language designed for General-Purpose GPU (GPGPU) computing that supports low-precision data types with arbitrary bit widths from 1 to 8 while maintaining GPU programmability. Tilus features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. Tilus programs are compiled into highly efficient GPU programs through automatic vectorization and instruction selection. Extensive experiments demonstrate that Tilus efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels. Compared to existing compilers such as Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, Tilus achieves performance improvements of: 1.75x, 2.61x, 1.29x and 1.03x, respectively. We open-source Tilus.

Tags: Computer science, CUDA, nVidia, nVidia A100, nVidia H100, nVidia L40s, Package, Programming Languages, PTX, Triton

December 29, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org