Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Department of Computer Science, Stanford University
arXiv:0910.0505 [cs.AR] (14 November 2009)
@conference{haque2010hard,
title={Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU},
author={Haque, I.S. and Pande, V.S.},
booktitle={Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on},
pages={691–696},
year={2010},
organization={IEEE}
}
Graphics processing units (GPUs) are gaining widespread use in computationalchemistry and other scientific simulation contexts because of their hugeperformance advantages relative to conventional CPUs. However, the reliabilityof GPUs in error-intolerant applications is largely unproven. In particular, alack of error checking and correcting (ECC) capability in the memory subsystemsof graphics cards has been cited as a hindrance to the acceptance of GPUs ashigh-performance coprocessors, but the impact of this design has not beenpreviously quantified.In this article we present MemtestG80, our software for assessing memoryerror rates on NVIDIA G80 and GT200-architecture-based graphics cards.Furthermore, we present the results of a large-scale assessment of GPU errorrate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@homedistributed computing network. Our control experiments on consumer-grade anddedicated-GPGPU hardware in a controlled environment found no errors. However,our survey over cards on Folding@home finds that, in their installedenvironments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitiverate of memory soft errors. We demonstrate that these errors persist aftercontrolling for overclocking and environmental proxies for temperature, butdepend strongly on board architecture.
November 9, 2010 by hgpu