high performance computing on graphics processing units: hgpu.org

Applications

hgpu.org » Applications

Cost-Performance Analysis: A Comparative Study of CPU-Based Serverless and GPU-Based Training Architectures

Amine Barrak, Fabio Petrillo, Fehmi Jaafar

View

Tags: Computer science, Databases, Machine learning, nVidia, Tesla T4

September 28, 2025 by hgpu

Robust LLM Training Infrastructure at ByteDance

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

View

Tags: AI, Computer science, CUDA, LLM, nVidia, nVidia L20

September 28, 2025 by hgpu

Dato: A Task-Based Programming Model for Dataflow Accelerators

Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang

View

Tags: Computer science, FPGA, nVidia, Package, Programming Languages, Python

September 21, 2025 by hgpu

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, David Ha

View

Tags: Benchmarking, Computer science, CUDA, Filtering, LLM, nVidia, nVidia H100, Package, Software Engineering

September 21, 2025 by hgpu

Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models

Siyuan Chen, Zhichao Lu, Qingfu Zhang

View

Tags: Computer science, CUDA, LLM, nVidia, Software Engineering

September 21, 2025 by hgpu

High Performance GPU Implementation of KNN Algorithm: A Review

Pooja Bidye, Pradnya Borkar, Nitin Rakesh

View

Tags: Computer science, Machine learning, Nearest neighbour, nVidia, Review

September 21, 2025 by hgpu

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

Dumeni Manatschal

View

Tags: Benchmarking, Computer science, CUDA, nVidia, nVidia GeForce RTX 3080, Performance, PTX, Thesis

September 14, 2025 by hgpu

An HPC Benchmark Survey and Taxonomy for Characterization

Andreas Herten, Olga Pearce, Filipe S. M. Guimarães

View

Tags: Benchmarking, Computer science, CUDA, Fortran, HIP, HPC, MPI, OpenACC, OpenCL, OpenMP, Package, Performance, ROCm, SYCL

September 14, 2025 by hgpu

Home-made Diffusion Model from Scratch to Hatch

Shih-Ying Yeh

View

Tags: Computer science, Computer vision, nVidia, nVidia GeForce RTX 5090, Package, Python, PyTorch

September 14, 2025 by hgpu

High Performance Matrix Multiplication

Ethan Davis

View

Tags: BLAS, Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, OpenMP, Package, Performance, Python, Tesla V100

September 14, 2025 by hgpu

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

Siddharth Samsi, Dan Campbell, Emanuel Scoullos, Oded Green

View

Tags: Benchmarking, Computer science, CUDA, HPC, nVidia, nVidia A100, nVidia H100, nVidia H200, Sensing

September 14, 2025 by hgpu

CUDAnalyst (CUDA + Analyst)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CodegenBench

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

Agentic Code Optimization via Compiler-LLM Cooperation

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: