high performance computing on graphics processing units: hgpu.org

hgpu.org » LLM

Kevin: Multi-Turn RL for Generating CUDA Kernels

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

View

Tags: Artificial intelligence, Computer science, CUDA, LLM, Machine learning, nVidia, nVidia H100, nVidia H200, Performance

July 20, 2025 by hgpu

Pre-Training LLMs on a budget: A comparison of three optimizers

Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, Fabian Küch

View

Download (PDF)

Tags: Artificial intelligence, Computer science, CUDA, LLM, Machine learning, nVidia, nVidia A100

July 20, 2025 by hgpu

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, HPC, LLM, Machine learning, nVidia, nVidia A100, nVidia GeForce RTX 4090, Package

July 13, 2025 by hgpu

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

Wali Mohammad Abdullah, Azmain Kabir

View

Download (PDF)

Tags: Code generation, Computer science, HPC, LLM, OpenMP, Software Engineering

July 6, 2025 by hgpu

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele

View

Download (PDF)

Source codes

Tags: Benchmarking, Code generation, Computer science, CUDA, LLM, nVidia, nVidia A100, OpenMP, Package

July 6, 2025 by hgpu

WiLLM: An Open Wireless LLM Communication System

Boyi Liu, Yongguang Lu, Jianguo Zhao, Qiang Yang, Wen Wu, Lin Chen, Jagmohan Chauhan, Jun Zhang

View

Download (PDF)

Source codes

Tags: Computer science, LLM, Network communications, nVidia, nVidia GeForce RTX 4090, Package

June 29, 2025 by hgpu

Omniwise: Predicting GPU Kernels Performance with LLMs

Zixian Wang, Cole Ramos, Muhammad A. Awad, Keith Lowery

View

Download (PDF)

Tags: AMD, AMD Radeon Instinct MI250, AMD Radeon Instinct MI300X, Artificial intelligence, Benchmarking, Computer science, LLM, Neural networks, Performance, ROCm

June 29, 2025 by hgpu

A First Look at Bugs in LLM Inference Engines

Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, Yun Ma

View

Download (PDF)

Tags: AI, Computer science, LLM, Software Engineering

June 22, 2025 by hgpu

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Wentao Chen, Jiace Zhu, Qi Fan, Yehan Ma, An Zou

View

Download (PDF)

Tags: Artificial intelligence, Code generation, Computer sceince, CUDA, LLM, nVidia, nVidia GeForce GTX 1660, nVidia GeForce RTX 3090 Ti

June 15, 2025 by hgpu

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Jiaqi Lv, Xufeng He, Yanchen Liu, Xu Dai, Yang Hu, Shouyi Yin

View

Download (PDF)

Source codes

Tags: AI, Benchmarking, Compilers, Computer science, CUDA, Deep learning, LLM, nVidia, nVidia A100, Package, performance portability

June 15, 2025 by hgpu

Acceleration as a Service (XaaS) Source Containers

Eiman Alnuaimi

View

Download (PDF)

Source codes

Tags: Computer science, Heterogeneous systems, HPC, Intel, Intel Data Center GPU Max 1550, LLM, MPI, nVidia, nVidia GH200, nVidia V100, Optimization, Package, performance portability, Thesis

June 8, 2025 by hgpu

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Yong-Cheng Liaw, Shuo-Han Chen

View

Download (PDF)

Tags: Artificial intelligence, Benchmarking, Computer science, LLM, Memory, nVidia, nVidia H100, nVidia RTX A5000

June 8, 2025 by hgpu

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

exa-AMD: Exascale Accelerated Materials Discovery

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Kevin: Multi-Turn RL for Generating CUDA Kernels

Pre-Training LLMs on a budget: A comparison of three optimizers

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

WiLLM: An Open Wireless LLM Communication System

Omniwise: Predicting GPU Kernels Performance with LLMs

A First Look at Bugs in LLM Inference Engines

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Acceleration as a Service (XaaS) Source Containers

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)