high performance computing on graphics processing units: hgpu.org

Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, HPC, LLM, Machine learning, nVidia, nVidia A100, nVidia GeForce RTX 4090, Package

July 13, 2025 by hgpu

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

Wali Mohammad Abdullah, Azmain Kabir

View

Download (PDF)

Tags: Code generation, Computer science, HPC, LLM, OpenMP, Software Engineering

July 6, 2025 by hgpu

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele

View

Download (PDF)

Source codes

Tags: Benchmarking, Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, OpenMP, Package

July 6, 2025 by hgpu

WiLLM: An Open Wireless LLM Communication System

Boyi Liu, Yongguang Lu, Jianguo Zhao, Qiang Yang, Wen Wu, Lin Chen, Jagmohan Chauhan, Jun Zhang

View

Download (PDF)

Source codes

Tags: Computer science, LLM, Network communications, nVidia, nVidia GeForce RTX 4090, Package

June 29, 2025 by hgpu

Omniwise: Predicting GPU Kernels Performance with LLMs

Zixian Wang, Cole Ramos, Muhammad A. Awad, Keith Lowery

View

Download (PDF)

Tags: AMD, AMD Radeon Instinct MI250, AMD Radeon Instinct MI300X, Artificial intelligence, Benchmarking, Computer science, LLM, Neural networks, Performance, ROCm

June 29, 2025 by hgpu

A First Look at Bugs in LLM Inference Engines

Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, Yun Ma

View

Download (PDF)

Tags: AI, Computer science, LLM, Software Engineering

June 22, 2025 by hgpu

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, An Zou

View

Download (PDF)

Tags: Artificial intelligence, Code generation, Computer sceince, CUDA, LLM, nVidia, nVidia GeForce GTX 1660, nVidia GeForce RTX 3090 Ti

June 15, 2025 by hgpu

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Jiaqi Lv, Xufeng He, Yanchen Liu, Xu Dai, Yang Hu, Shouyi Yin

View

Download (PDF)

Source codes

Tags: AI, Benchmarking, Compilers, Computer science, CUDA, Deep learning, LLM, nVidia, nVidia A100, Package, performance portability

June 15, 2025 by hgpu

Kernel Library for LLM Serving

Compiler and Runtime Systems for Generative AI Models

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

A Performance Portable Matrix Free Dense MTTKRP in GenTen

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

Accelerating cosmological simulations on GPUs: a portable approach using OpenMP

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

exa-AMD: An Exascale-Ready Framework for Accelerating the Discovery and Design of Functional Materials

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

Kevin: Multi-Turn RL for Generating CUDA Kernels

Pre-Training LLMs on a budget: A comparison of three optimizers

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

WiLLM: An Open Wireless LLM Communication System

Omniwise: Predicting GPU Kernels Performance with LLMs

A First Look at Bugs in LLM Inference Engines

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)