Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

Imran S. Haque, Vijay S. Pande
Department of Computer Science, Stanford University
arXiv:0910.0505 [cs.AR] (14 November 2009)


   title={Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU},

   author={Haque, I.S. and Pande, V.S.},

   booktitle={Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on},





Graphics processing units (GPUs) are gaining widespread use in computationalchemistry and other scientific simulation contexts because of their hugeperformance advantages relative to conventional CPUs. However, the reliabilityof GPUs in error-intolerant applications is largely unproven. In particular, alack of error checking and correcting (ECC) capability in the memory subsystemsof graphics cards has been cited as a hindrance to the acceptance of GPUs ashigh-performance coprocessors, but the impact of this design has not beenpreviously quantified.In this article we present MemtestG80, our software for assessing memoryerror rates on NVIDIA G80 and GT200-architecture-based graphics cards.Furthermore, we present the results of a large-scale assessment of GPU errorrate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@homedistributed computing network. Our control experiments on consumer-grade anddedicated-GPGPU hardware in a controlled environment found no errors. However,our survey over cards on Folding@home finds that, in their installedenvironments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitiverate of memory soft errors. We demonstrate that these errors persist aftercontrolling for overclocking and environmental proxies for temperature, butdepend strongly on board architecture.
No votes yet.
Please wait...

Recent source codes

* * *

* * *

HGPU group © 2010-2019 hgpu.org

All rights belong to the respective authors

Contact us: