high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Understanding the Landscape of Ampere GPU Memory Errors

Understanding the Landscape of Ampere GPU Memory Errors

Zhu Zhu, Yu Sun, Dhatri Parakal, Bo Fang, Steven Farrell, Gregory H. Bauer, Brett Bode, Ian T. Foster, Michael E. Papka, William Gropp, Zhao Zhang, Lishan Yang

George Mason University, USA

arXiv:2508.03513 [cs.DC], (5 Aug 2025)

DOI:10.48550/arXiv.2508.03513

@misc{zhu2025understandinglandscapeamperegpu,

title={Understanding the Landscape of Ampere GPU Memory Errors},

author={Zhu Zhu and Yu Sun and Dhatri Parakal and Bo Fang and Steven Farrell and Gregory H. Bauer and Brett Bode and Ian T. Foster and Michael E. Papka and William Gropp and Zhao Zhang and Lishan Yang},

year={2025},

eprint={2508.03513},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2508.03513}

}

Download (PDF)

View

Source

4029

views

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC systems. In this work, we present a large-scale cross-supercomputer study to characterize GPU memory reliability, covering three supercomputers – Delta, Polaris, and Perlmutter – all equipped with NVIDIA A100 GPUs. We examine error logs spanning 67.77 million GPU device-hours across 10,693 GPUs. We compare error rates and mean-time-between-errors (MTBE) and highlight both shared and distinct error characteristics among these three systems. Based on these observations and analyses, we discuss the implications and lessons learned, focusing on the reliable operation of supercomputers, the choice of checkpointing interval, and the comparison of reliability characteristics with those of previous-generation GPUs. Our characterization study provides valuable insights into fault-tolerant HPC system design and operation, enabling more efficient execution of HPC applications.

Tags: Computer science, HPC, Memory, nVidia, nVidia A100, nVidia A40

August 10, 2025 by hgpu

No votes yet.

Please wait...