6696

G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing

Felix Loh, Matt Sinclair
The University of Wisconsin-Madison
ECE 753 Project Progress Report Spring 2010

@article{sinclair2010g,

   title={G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing},

   author={Sinclair, F.L.M.},

   year={2010}

}

Download Download (PDF)   View View   Source Source   

1924

views

GPUs have become increasingly popular in recent years, in large part due to their potential to offer a large amount of computational power at low prices. GPU designers have also made GPU pipelines more general purpose and more programmable, which has made GPUs more attractive to a wider audience. Thus, it is increasingly important to provide fault tolerance in GPUs. However, pre-Fermi Nvidia GPUs do not provide fault tolerance. Since GPUs are now often used in high performance computing and other general purpose application domains where data integrity is important, providing fault tolerance on GPUs is becoming increasingly important. In this project, we present G-CP, a mechanism for providing fault tolerance support in GPUs through use of software checkpointing combined with time and space redundancy. In this way, GPU algorithms will be able to periodically checkpoint their work. If a fault has occurred, then the user can roll back to the last checkpoint and continue executing.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: