high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

Haonan Li, Keyu Man, Partha Kanuparthy, Hanning Chen, Wei Sun, Sreen Tallam, Chenguang Zhu, Kevin Zhu, Zhiyun Qian

University of California, Riverside

arXiv:2512.09196 [cs.SE], (9 Dec 2025)

DOI:10.48550/arXiv.2512.09196

@misc{li2025tritonforgeprofilingguidedframeworkautomated,

title={TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization},

author={Haonan Li and Keyu Man and Partha Kanuparthy and Hanning Chen and Wei Sun and Sreen Tallam and Chenguang Zhu and Kevin Zhu and Zhiyun Qian},

year={2025},

eprint={2512.09196},

archivePrefix={arXiv},

primaryClass={cs.SE},

url={https://arxiv.org/abs/2512.09196}

}

Download (PDF)

View

Source

Source codes

Package:

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

1389

views

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with concise code, achieving expert-level performance still requires deep understanding of GPU architectures and low-level performance trade-offs. We present TritonForge, a profiling-guided framework for automated Triton kernel optimization. TritonForge integrates kernel analysis, runtime profiling, and iterative code transformation to streamline the optimization process. By incorporating data-driven feedback from profiling results, the system identifies performance bottlenecks, proposes targeted code modifications, and evaluates their impact automatically. While our prototype leverages large language models (LLMs) to assist in code reasoning and transformation, the framework remains modular and model-agnostic. Across diverse kernel types and GPU architectures, TritonForge achieves up to 5x performance improvement over baseline implementations and on average 1.76x of the cases are successful, providing a foundation for future research in automated GPU performance optimization.

Tags: Computer science, CUDA, LLM, Machine learning, nVidia, nVidia H100, Package, PyTorch, Triton

December 15, 2025 by hgpu

No votes yet.

Please wait...