gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
University of California, Riverside, Riverside, United States of America
arXiv:2308.05199 [cs.DC], (9 Aug 2023)
@misc{huang2023gzccl,
title={gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters},
author={Jiajun Huang and Sheng Di and Xiaodong Yu and Yujia Zhai and Jinyang Liu and Yafan Huang and Ken Raffenetti and Hui Zhou and Kai Zhao and Zizhong Chen and Franck Cappello and Yanfei Guo and Rajeev Thakur},
year={2023},
eprint={2308.05199},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
August 13, 2023 by hgpu