high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia RTX PRO 6000

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

Shiyang Li, Zijian Zhang, Guangyan Sun, Yuebo Luo, Winson Chen, Yanzhi Wang, Mingyi Hong, Caiwen Ding

View

Download (PDF)

Tags: Benchmarking, Computer science, CUDA, LLM, nVidia, nVidia RTX PRO 6000

May 20, 2026 by hgpu

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

View

Download (PDF)

Tags: Benchmarking, Computer science, CUDA, LLM, Machine learning, nVidia, nVidia A100, nVidia H200, nVidia RTX PRO 6000

May 20, 2026 by hgpu

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, LLM, nVidia, nVidia B200, nVidia H100, nVidia RTX PRO 6000, Package, Performance, Triton

May 3, 2026 by hgpu

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

View

Download (PDF)

Tags: AMD, Computer science, LLM, nVidia, nVidia GeForce RTX 5090, nVidia RTX PRO 6000, Precision

March 15, 2026 by hgpu

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding

View

Download (PDF)

Tags: Benchmarking, Code generation, Computer science, CUBLAS, CUDA, LLM, nVidia, nVidia H200, nVidia RTX PRO 6000

March 4, 2026 by hgpu

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Recent source codes

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Most viewed papers (last 30 days)