high performance computing on graphics processing units: hgpu.org

Programming

hgpu.org » Programming

Real FP4 Tensor-Core Code in Pure Rust on a Gaming GPU – with NVIDIA’s Own Compiler

Carter Richardson

View

Download (PDF)

Tags: Computer science, CUDA, nVidia, nVidia GeForce RTX 5070 Ti, PTX, Rust

July 13, 2026 by hgpu

Enhancing the Performance Analysis of NCCL GPU Collectives

Jurij Cerar

View

Download (PDF)

Tags: Computer science, CUDA, nVidia, nVidia H100, Performance, Tesla T4, Thesis

July 13, 2026 by hgpu

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

Ximing Fan, Yong Fang, Peng Jia, Yang Liu, Yijia Xu, Xi Peng, Yuhao Zhou

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, FFT, LLM, nVidia, Package, Security

July 13, 2026 by hgpu

Augmenting LLM Code Translation with Compiler Analysis for C to Triton Kernel Generation

Xiao Qin, Chunwei Xia, Zheng Wang

View

Download (PDF)

Tags: Computer science, CUDA, LLM, nVidia, nVidia GeForce RTX 3090, Triton

July 13, 2026 by hgpu

SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation

Jihu Guo, Sitian Lu, Tenghui Ma, Wei Gao, Zhisheng Ye, Xingcheng Zhang, Dahua Lin

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, nVidia, nVidia H200

June 28, 2026 by hgpu

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang, Vihang Patil, Haoyang Fang, Bernie Wang, Huzefa Rangwala, George Karypis

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, Triton

June 28, 2026 by hgpu

The Correctness Illusion in LLM-Generated GPU Kernels

Dipankar Sarkar

View

Download (PDF)

Tags: Benchmarking, Code generation, Computer science, CUDA, LLM, nVidia, nVidia A10, nVidia A100, nVidia GeForce RTX 3060, nVidia H100, nVidia L40s, Triton

June 28, 2026 by hgpu

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li

View

Download (PDF)

Tags: Computer science, CUDA, LLM, Triton

June 17, 2026 by hgpu

From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

Wentao Chen, Jiace Zhu, Xing Zhe Chai, Zeng Qu, Qiaoling Xiao, Liucheng Duan, An Zou

View

Download (PDF)

Tags: Computer science, CUDA, LLM, nVidia, nVidia GeForce RTX 3090 Ti

June 17, 2026 by hgpu

Fearless Concurrency on the GPU

Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, Michael Garland

View

Download (PDF)

Tags: Computer science, CUBLAS, CUDA, nVidia, nVidia B200, nVidia GeForce RTX 5090, Performance, Python, Rust

June 17, 2026 by hgpu

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Heterogeneous systems, LLM, nVidia, Package, PTX

June 8, 2026 by hgpu

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

View

Download (PDF)

Tags: Computer science, CUDA, LLM, PyTorch

June 8, 2026 by hgpu

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Programming

Real FP4 Tensor-Core Code in Pure Rust on a Gaming GPU – with NVIDIA’s Own Compiler

Enhancing the Performance Analysis of NCCL GPU Collectives

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

Augmenting LLM Code Translation with Compiler Analysis for C to Triton Kernel Generation

SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

The Correctness Illusion in LLM-Generated GPU Kernels

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

Fearless Concurrency on the GPU

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)