G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing
The University of Wisconsin-Madison
ECE 753 Project Progress Report Spring 2010
@article{sinclair2010g,
title={G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing},
author={Sinclair, F.L.M.},
year={2010}
}
GPUs have become increasingly popular in recent years, in large part due to their potential to offer a large amount of computational power at low prices. GPU designers have also made GPU pipelines more general purpose and more programmable, which has made GPUs more attractive to a wider audience. Thus, it is increasingly important to provide fault tolerance in GPUs. However, pre-Fermi Nvidia GPUs do not provide fault tolerance. Since GPUs are now often used in high performance computing and other general purpose application domains where data integrity is important, providing fault tolerance on GPUs is becoming increasingly important. In this project, we present G-CP, a mechanism for providing fault tolerance support in GPUs through use of software checkpointing combined with time and space redundancy. In this way, GPU algorithms will be able to periodically checkpoint their work. If a fault has occurred, then the user can roll back to the last checkpoint and continue executing.
December 25, 2011 by hgpu