high performance computing on graphics processing units: hgpu.org

hgpu.org » Computer science

Iris: First-Class Multi-GPU Programming Experience in Triton

Muhammad Awad, Muhammad Osama, Brandon Potter

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI300X, ATI, Benchmarking, Computer science, CUDA, HIP, nVidia, Package, Python, Triton

November 23, 2025 by hgpu

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

Pedro Antunes, Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luís Frazão, Nuno Costa, António Pereira

View

Download (PDF)

Tags: AMD Radeon Pro W6600, ATI, Computer science, CUDA, Heterogeneous systems, LLM, nVidia, nVidia GeForce GTX 1660, nVidia GeForce RTX 3070, ROCm

November 23, 2025 by hgpu

ProofWright: Towards Agentic Formal Verification of CUDA

Bodhisatwa Chatterjee, Drew Zagieboylo, Sana Damani, Siva Hari, Christos Kozyrakis

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, LLM, nVidia

November 23, 2025 by hgpu

The Anatomy of a Triton Attention Kernel

Burkhard Ringlein, Jan van Lunteren, Radu Stoica, Thomas Parnell

View

Download (PDF)

Tags: AMD Radeon Instinct MI250, AMD Radeon Instinct MI300X, ATI, Computer science, CUDA, DSL, HIP, LLM, nVidia, nVidia H100, Performance, Programming Languages, Triton

November 23, 2025 by hgpu

Inside VOLT: Designing an Open-Source GPU Compiler

Shinnung Jeong, Chihyo Ahn, Huanzhi Pu, Jisheng Zhao, Hyesoon Kim, Blaise Tine

View

Download (PDF)

Tags: Code generation, Compilers, Computer science, CUDA, nVidia, OpenCL

November 23, 2025 by hgpu

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, Depei Qian

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, Performance

November 16, 2025 by hgpu

HipKittens: Fast and Furious AMD Kernels

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI355X, ATI, Computer science, nVidia, nVidia B200, Package, Performance

November 16, 2025 by hgpu

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

Gabriel Rodriguez-Canal, David Katz, Nick Brown

View

Download (PDF)

Source codes

Tags: Computer science, Fortran, FPGA, Heterogeneous systems, HLS, HPC, LLVM, OpenMP, Package

November 16, 2025 by hgpu

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

Stepan Vanecek, Manuel Walter Mussbacher, Dominik Groessler, Urvij Saroliya, Martin Schulz

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI100, AMD Radeon Instinct MI210, AMD Radeon Instinct MI300X, ATI, Benchmarking, Computer science, CUDA, HIP, nVidia, nVidia A100, nVidia GeForce RTX 2080, nVidia H100, nVidia Quadro P 6000, nVidia V100, Package, PTX

November 16, 2025 by hgpu

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

Zheng Li, Weiyan Wang, Ruiyuan Li, Chao Chen, Xianlei Long, Linjiang Zheng, Quanqing Xu, Chuanhui Yang

View

Download (PDF)

Source codes

Tags: Algorithms, Compression, Computer science, CUDA, Heterogeneous systems, nVidia, nVidia GeForce RTX 5080, Package

November 16, 2025 by hgpu

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, nVidia, nVidia A100, nVidia GeForce RTX 3090, nVidia GeForce RTX 4090, nVidia RTX 6000 Ada, Package, Performance

November 9, 2025 by hgpu

RDMA Point-to-Point Communication for LLM Systems

Nandor Licker, Kevin Hu, Vladimir Zaytsev, Lequn Chen

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, LLM, nVidia, nVidia H200, Package, Performance, RDMA

November 9, 2025 by hgpu

CUDAnalyst (CUDA + Analyst)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CodegenBench

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Iris: First-Class Multi-GPU Programming Experience in Triton

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

ProofWright: Towards Agentic Formal Verification of CUDA

The Anatomy of a Triton Attention Kernel

Inside VOLT: Designing an Open-Source GPU Compiler

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

HipKittens: Fast and Furious AMD Kernels

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

RDMA Point-to-Point Communication for LLM Systems

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Most viewed papers (last 30 days)