A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors
University of Virginia
In GH ’07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware (2007), pp. 55-64.
@conference{sheaffer2007hardware,
title={A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors},
author={Sheaffer, J.W. and Luebke, D.P. and Skadron, K.},
booktitle={Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware},
pages={55–64},
year={2007},
organization={Eurographics Association}
}
General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD’s Close to the Metal (CTM) and NVIDIA’s Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.
October 30, 2010 by hgpu