high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Matrix Multiplication on GPUs with On-Line Fault Tolerance

Matrix Multiplication on GPUs with On-Line Fault Tolerance

Chong Ding, Christer Karlsson, Hui Liu, Teresa Davies, Zizhong Chen

IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011

DOI:10.1109/ISPA.2011.50

@inproceedings{ding2011matrix,

title={Matrix Multiplication on GPUs with On-Line Fault Tolerance},

author={Ding, C. and Karlsson, C. and Liu, H. and Davies, T. and Chen, Z.},

booktitle={Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE 9th International Symposium on},

pages={311–317},

year={2011},

organization={IEEE}

}

Source

1954

views

Commercial graphics processing units (GPUs) prove their attractive, inexpensive in high performance scientific applications. However, a recent research through Folding@home demonstrates that two-thirds of tested GPUs on Folding@home exhibit a detectable, pattern-sensitive rate of memory soft errors for GPGPU. Fault tolerance has been viewed as critical to the effective use of these GPUs. In this paper, we present an on-line GPU error detection, location, and correction method to incorporate fault tolerance into matrix multiplication. The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs. The proposed on-line fault tolerance mechanism detects soft errors in the middle of the computation so that better reliability can be achieved by correcting corrupted computations in time. Experimental results demonstrate that the proposed method is highly efficient.

Tags: Algorithms, Computer science, Linear Algebra, Matrix multiplication

August 9, 2011 by hgpu

No votes yet.

Please wait...