Design and Modeling of a Non-blocking Checkpointing System

Kento Sato, Kathryn Mohror, Adam Moody, Naoya Maruyama, Satoshi Matsuoka
Dep. of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1-W8-33, Ohokayama, Meguro-ku, Tokyo 152-8552 Japan
Supercomputing 2012 (SC’12), 2012


   title={Design and Modeling of a Non-blocking Checkpointing System},

   author={Sato, K. and Mohror, K. and Moody, A. and Gamblin, T. and de Supinski, B.R. and Maruyama, N. and Matsuoka, S.},



Download Download (PDF)   View View   Source Source   



As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on todays machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: