high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia B200

Fearless Concurrency on the GPU

Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, Michael Garland

View

Tags: Computer science, CUBLAS, CUDA, nVidia, nVidia B200, nVidia GeForce RTX 5090, Performance, Python, Rust

June 17, 2026 by hgpu

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Taras Sereda, Burak Bartan, Ankita Nayak, Tom St.John, Natalie Serrino, Zain Asgar

View

Tags: Code generation, Computer science, CUDA, Heterogeneous systems, Intel, Intel Arc B580, nVidia, nVidia B200, PTX, Triton

June 8, 2026 by hgpu

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Aaron Jarmusch, Sunita Chandrasekaran

View

Tags: AMD, AMD Radeon Instinct MI250X, AMD Radeon Instinct MI300A, Benchmarking, Computer science, CUDA, nVidia, nVidia B200

May 11, 2026 by hgpu

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

View

Tags: Computer science, CUBLAS, CUDA, LLM, nVidia, nVidia B200, nVidia H100, nVidia RTX PRO 6000, Package, Performance, Triton

May 3, 2026 by hgpu

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Jaber Jaber, Osama Jaber

View

Tags: Computer science, CUDA, Machine learning, nVidia, nVidia B200, nVidia H100, Package, Triton

March 26, 2026 by hgpu

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi

View

Tags: Benchmarking, Computer science, CUDA, nVidia, nVidia B200, Package, Triton

March 22, 2026 by hgpu

Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)

Bastian Hagedorn, Alexander Collins, Tony Mongkolsmai, Vinod Grover

View

Tags: Computer science, CUDA, nVidia, nVidia B200, Package, Performance, Profiling, Python, Triton

February 2, 2026 by hgpu

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

View

Tags: Computer science, CUDA, nVidia, nVidia B200, nVidia H100, Programming Languages

December 29, 2025 by hgpu

Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak, Mantas Mikaitis

View

Tags: Computer science, CUDA, Matrix multiplication, nVidia, nVidia B200, nVidia H100, nVidia V100, Package

December 15, 2025 by hgpu

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

Aaron Jarmusch, Sunita Chandrasekaran

View

Tags: Benchmarking, Computer science, CUDA, HPC, Machine learning, nVidia, nVidia B200, nVidia H200, PTX

December 7, 2025 by hgpu

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher Ré

View

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, nVidia B200, nVidia H100, Package

November 30, 2025 by hgpu

HipKittens: Fast and Furious AMD Kernels

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora

View

Tags: AMD Radeon Instinct MI355X, ATI, Computer science, nVidia, nVidia B200, Package, Performance

November 16, 2025 by hgpu

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

CUDAnalyst (CUDA + Analyst)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CodegenBench

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

Agentic Code Optimization via Compiler-LLM Cooperation

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: