Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

Imran S. Haque, Vijay S. Pande
Department of Computer Science, Stanford University
arXiv:0910.0505 [cs.AR] (14 November 2009)


   title={Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU},

   author={Haque, I.S. and Pande, V.S.},

   booktitle={Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on},





Graphics processing units (GPUs) are gaining widespread use in computationalchemistry and other scientific simulation contexts because of their hugeperformance advantages relative to conventional CPUs. However, the reliabilityof GPUs in error-intolerant applications is largely unproven. In particular, alack of error checking and correcting (ECC) capability in the memory subsystemsof graphics cards has been cited as a hindrance to the acceptance of GPUs ashigh-performance coprocessors, but the impact of this design has not beenpreviously quantified.In this article we present MemtestG80, our software for assessing memoryerror rates on NVIDIA G80 and GT200-architecture-based graphics cards.Furthermore, we present the results of a large-scale assessment of GPU errorrate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@homedistributed computing network. Our control experiments on consumer-grade anddedicated-GPGPU hardware in a controlled environment found no errors. However,our survey over cards on Folding@home finds that, in their installedenvironments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitiverate of memory soft errors. We demonstrate that these errors persist aftercontrolling for overclocking and environmental proxies for temperature, butdepend strongly on board architecture.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: