high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou

Shanghai Jiao Tong University, Shanghai, China

arXiv:2603.02236 [cs.LG], (13 Feb 2026)

DOI:10.48550/arXiv.2603.02236

@misc{zhu2026cudabench,

title={CUDABench: Benchmarking LLMs for Text-to-CUDA Generation},

author={Jiace Zhu and Wentao Chen and Qi Fan and Zhixing Ren and Junying Wu and Xing Zhe Chai and Chotiwit Rungrueangwutthinon and Yehan Ma and An Zou},

year={2026},

eprint={2603.02236},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2603.02236}

}

Download (PDF)

View

Source

Source codes

Package:

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

1000

views

Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available.

Tags: Benchmarking, Computer science, CUDA, LLM, nVidia, nVidia A40, nVidia GeForce RTX 4090, Package

March 4, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)