Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Department of Computer Science, Stanford University
arXiv:0910.0505 [cs.AR] (14 November 2009)
Graphics processing units (GPUs) are gaining widespread use in computationalchemistry and other scientific simulation contexts because of their hugeperformance advantages relative to conventional CPUs. However, the reliabilityof GPUs in error-intolerant applications is largely unproven. In particular, alack of error checking and correcting (ECC) capability in the memory subsystemsof graphics cards has been cited as a hindrance to the acceptance of GPUs ashigh-performance coprocessors, but the impact of this design has not beenpreviously quantified.In this article we present MemtestG80, our software for assessing memoryerror rates on NVIDIA G80 and GT200-architecture-based graphics cards.Furthermore, we present the results of a large-scale assessment of GPU errorrate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@homedistributed computing network. Our control experiments on consumer-grade anddedicated-GPGPU hardware in a controlled environment found no errors. However,our survey over cards on Folding@home finds that, in their installedenvironments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitiverate of memory soft errors. We demonstrate that these errors persist aftercontrolling for overclocking and environmental proxies for temperature, butdepend strongly on board architecture.
November 9, 2010 by hgpu