Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

hgpu.org » Applications » Computer science » Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

Imran S. Haque, Vijay S. Pande

Department of Computer Science, Stanford University

arXiv:0910.0505 [cs.AR] (14 November 2009)

DOI:10.1109/CCGRID.2010.84

BibTeX

Download (PDF)

View

Source

Source codes

Package:

MemtestG80 and MemtestCL: Memory Testers for CUDA- and OpenCL-enabled GPUs

1874

views

Graphics processing units (GPUs) are gaining widespread use in computationalchemistry and other scientific simulation contexts because of their hugeperformance advantages relative to conventional CPUs. However, the reliabilityof GPUs in error-intolerant applications is largely unproven. In particular, alack of error checking and correcting (ECC) capability in the memory subsystemsof graphics cards has been cited as a hindrance to the acceptance of GPUs ashigh-performance coprocessors, but the impact of this design has not beenpreviously quantified.In this article we present MemtestG80, our software for assessing memoryerror rates on NVIDIA G80 and GT200-architecture-based graphics cards.Furthermore, we present the results of a large-scale assessment of GPU errorrate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@homedistributed computing network. Our control experiments on consumer-grade anddedicated-GPGPU hardware in a controlled environment found no errors. However,our survey over cards on Folding@home finds that, in their installedenvironments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitiverate of memory soft errors. We demonstrate that these errors persist aftercontrolling for overclocking and environmental proxies for temperature, butdepend strongly on board architecture.

Tags: Computer science, CUDA, Hardware Architecture, nVidia, nVidia GeForce 8800 GTX, nVidia GeForce 9500 GT, Package, Tesla C870

November 9, 2010 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org