Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs
Department of Computer Science, Iowa State University, Ames, IA
arXiv:2410.09172 [math.NA], (11 Oct 2024)
@misc{zahid2024testinggpunumericsfinding,
title={Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs},
author={Anwar Hossain Zahid and Ignacio Laguna and Wei Le},
year={2024},
eprint={2410.09172},
archivePrefix={arXiv},
primaryClass={math.NA},
url={https://arxiv.org/abs/2410.09172}
}
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.
October 20, 2024 by hgpu