High performance dense linear system solver with soft error resilience

Peng Du, Piotr Luszczek, Jack Dongarra
Electrical Engineering and Computer Science Department, University of Tennessee, Knoxville
IEEE International Conference on Cluster Computing (CLUSTER), 2011


   title={High performance dense linear system solver with soft error resilience},

   author={Du, P. and Luszczek, P. and Dongarra, J.},

   booktitle={Cluster Computing (CLUSTER), 2011 IEEE International Conference on},





Download Download (PDF)   View View   Source Source   



As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: