Matrix Multiplication on GPUs with On-Line Fault Tolerance
IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011
@inproceedings{ding2011matrix,
title={Matrix Multiplication on GPUs with On-Line Fault Tolerance},
author={Ding, C. and Karlsson, C. and Liu, H. and Davies, T. and Chen, Z.},
booktitle={Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE 9th International Symposium on},
pages={311–317},
year={2011},
organization={IEEE}
}
Commercial graphics processing units (GPUs) prove their attractive, inexpensive in high performance scientific applications. However, a recent research through Folding@home demonstrates that two-thirds of tested GPUs on Folding@home exhibit a detectable, pattern-sensitive rate of memory soft errors for GPGPU. Fault tolerance has been viewed as critical to the effective use of these GPUs. In this paper, we present an on-line GPU error detection, location, and correction method to incorporate fault tolerance into matrix multiplication. The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs. The proposed on-line fault tolerance mechanism detects soft errors in the middle of the computation so that better reliability can be achieved by correcting corrupted computations in time. Experimental results demonstrate that the proposed method is highly efficient.
August 9, 2011 by hgpu