high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia

Fine-Tuning GPT-5 for GPU Kernel Generation

Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Łukasz Dudziak, Mohamed S. Abdelfattah

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia H100, Triton

February 23, 2026 by hgpu

OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

Arijit Bhattacharjee, Heng Ping, Son Vu Le, Paul Bogdan, Nesreen K. Ahmed, Ali Jannesari

View

Download (PDF)

Tags: Computer science, CUDA, LLM, nVidia, nVidia A100, Performance

February 23, 2026 by hgpu

Improving Code Generation via Small Language Model-as-a-judge

Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota

View

Download (PDF)

Tags: Code generation, Computer science, LLM, nVidia, nVidia GeForce RTX 3090

February 16, 2026 by hgpu

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri

View

Download (PDF)

Tags: Benchmarking, Code generation, Computer science, CUDA, HPC, LLM, Matrix multiplication, nVidia, nVidia H100, OpenMP, Performance

February 16, 2026 by hgpu

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, Package

February 16, 2026 by hgpu

Deep Kernel Fusion for Transformers

Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins

View

Download (PDF)

Tags: Computer science, CUDA, LLM, nVidia, nVidia A100, Performance

February 16, 2026 by hgpu

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Heehoon Kim, Jaehwan Lee, Taejeoung Kim, Jongwon Park, Jinpyo Kim, Pyongwon Suh, Ryan H. Choi, Sangwoo Lee, Jaejin Lee

View

Download (PDF)

Tags: AMD, AMD FirePro W7800, Computer science, Deep learning, GPU cluster, Heterogeneous systems, LLM, nVidia, Tesla V100

February 8, 2026 by hgpu

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, LLM, nVidia, nVidia H100, Package, Triton

February 8, 2026 by hgpu

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Ruobing Han, Hyesoon Kim

View

Download (PDF)

Tags: Compilers, Computer science, CUDA, nVidia, nVidia A100, nVidia V100, Triton

February 8, 2026 by hgpu

Inside VOLT: Designing an Open-Source GPU Compiler (Tool)

Shinnung Jeong, Chihyo Ahn, Huanzhi Pu, Jisheng Zhao, Hyesoon Kim, Blaise Pascal Tine

View

Download (PDF)

Source codes

Tags: Compilers, Computer science, CUDA, FPGA, nVidia, OpenCL, Package

February 8, 2026 by hgpu

Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs

Jonathan Knoop, Hendrik Holtmann

View

Download (PDF)

Source codes

Tags: Cloud, Computer science, CUDA, LLM, nVidia, nVidia GeForce RTX 5060 Ti, nVidia GeForce RTX 5070 Ti, nVidia GeForce RTX 5090, Package

February 2, 2026 by hgpu

Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)

Bastian Hagedorn, Alexander Collins, Tony Mongkolsmai, Vinod Grover

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, nVidia, nVidia B200, Package, Performance, Profiling, Python, Triton

February 2, 2026 by hgpu

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Fine-Tuning GPT-5 for GPU Kernel Generation

OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

Improving Code Generation via Small Language Model-as-a-judge

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Deep Kernel Fusion for Transformers

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Inside VOLT: Designing an Open-Source GPU Compiler (Tool)

Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)