Software Reliability Enhancements for GPU Applications
School of Electrical and Computer Engineering, Georgia Institute of Technology, USA
Sixth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013), held in conjunction with the 8th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 2013
@article{li2013software,
title={Software Reliability Enhancements for GPU Applications},
author={Li, S. and Farooqui, N. and Yalamanchili, S.},
year={2013}
}
As the role of highly-parallel accelerators becomes more important in high performance computing, so does the need to ensure their reliable operation. In applications where precision and correctness is a necessity, bit-level reliable operation is required. While there exist mechanisms for error detection and correction, the cost-effective implementation in massively parallel accelerators is still an active area of research. In this paper we present an alternative software based approach for improving the reliability of massively parallel bulk synchronous processors such as modern GPUs. Specfifically, we propose a set of software reliability enhancements via transparent code patching of GPU applications. Reliability enhancements can be applied selectively at runtime, customized by the user, and transparent to the application. Runtime overhead ranges from 1-737% depending on the nature of the enhancement. We provide an analysis of benefits and limitations.
February 2, 2013 by hgpu