high performance computing on graphics processing units: hgpu.org

hgpu.org » GPU cluster

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, Jeff Hammond, Torsten Hoefler

View

Download (PDF)

Tags: Computer science, CUDA, GPU cluster, Network communication, nVidia, nVidia GH200

July 13, 2025 by hgpu

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Kunming Zhang, Hanlong Liao, Guoming Tang

View

Download (PDF)

Tags: Computer science, GPU cluster, Heterogeneous systems, Machine learning, nVidia, nVidia A800, nVidia GeForce RTX 4090, nVidia H100, nVidia RTX A6000, nVidia V100

June 22, 2025 by hgpu

FLASH: Fast All-to-All Communication in GPU Clusters

Yiran Lei, Dongjoo Lee, Liangyu Zhao, Daniar Kurniawan, Chanmyeong Kim, Heetaek Jeong, Changsu Kim, Hyeonseong Choi, Liangcheng Yu, Arvind Krishnamurthy, Justine Sherry, Eriko Nurvitadhi

View

Download (PDF)

Tags: AMD Radeon Instinct MI300X, ATI, Computer science, GPU cluster, Heterogeneous systems, MPI, nVidia, nVidia A100, nVidia B200, nVidia H100

May 25, 2025 by hgpu

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang

View

Download (PDF)

Tags: Computer science, CUDA, Deep learning, GPU cluster, nVidia, nVidia GeForce RTX 2080 Ti, OpenMPI, PyTorch, Task scheduling

August 4, 2024 by hgpu

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

View

Download (PDF)

Tags: Computer science, GPU cluster, Heterogeneous systems, nVidia, nVidia A100, nVidia V100, Tesla T4

June 9, 2024 by hgpu

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Peter Thoman, Philip Salzmann

View

Download (PDF)

Source codes

Tags: Benchmarking, Computer science, GPU cluster, HPC, nVidia, nVidia V100, Package, SYCL

April 14, 2024 by hgpu

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, Wei Lin

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Distributed computing, GPU cluster, nVidia, nVidia A100, nVidia P100, nVidia V100, Package, PyTorch

January 14, 2024 by hgpu

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

View

Download (PDF)

Tags: Compression, Computer science, GPU cluster, MPI, nVidia, nVidia A100

August 13, 2023 by hgpu

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Kaijie Fan, Marco D'Antonio, Lorenzo Carpentieri, Biagio Cosenza, Federico Ficarelli, Daniele Cesarini

View

Download (PDF)

Tags: AMD Radeon Instinct MI100, ATI, Computer science, Energy-efficient computing, GPU cluster, Heterogeneous systems, Machine learning, nVidia, nVidia V100, SYCL

August 13, 2023 by hgpu

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Xinchi Han, Weihao Jiang, Peirui Cao, Qinwei Yang, Yunzhuo Liu, Shuyao Qi, Shengkai Lin, Shizhen Zhao

View

Download (PDF)

Source codes

Tags: Computer science, GPU cluster, Machine learning, Neural networks, nVidia, nVidia V100, Package

August 13, 2023 by hgpu

Communication-minimizing Asynchronous Tensor Parallelism

Siddharth Singh, Zack Sating, Abhinav Bhatele

View

Download (PDF)

Tags: Computer science, CUDA, GPU cluster, Neural networks, nVidia, nVidia A100

May 28, 2023 by hgpu

high performance computing on graphics processing units: hgpu.org

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

FLASH: Fast All-to-All Communication in GPU Clusters

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Communication-minimizing Asynchronous Tensor Parallelism

Recent source codes

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Most viewed papers (last 30 days)