high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Design and Modeling of a Non-blocking Checkpointing System

Design and Modeling of a Non-blocking Checkpointing System

Kento Sato, Kathryn Mohror, Adam Moody, Naoya Maruyama, Satoshi Matsuoka

Dep. of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1-W8-33, Ohokayama, Meguro-ku, Tokyo 152-8552 Japan

Supercomputing 2012 (SC’12), 2012

@article{sato2012design,

title={Design and Modeling of a Non-blocking Checkpointing System},

author={Sato, K. and Mohror, K. and Moody, A. and Gamblin, T. and de Supinski, B.R. and Maruyama, N. and Matsuoka, S.},

year={2012}

}

Download (PDF)

View

Source

2340

views

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on todays machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

Tags: Computer science, Fault tolerance, GPU cluster, nVidia, Tesla M2050

September 11, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Design and Modeling of a Non-blocking Checkpointing System

Your response

Recent source codes

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Most viewed papers (last 30 days)

Design and Modeling of a Non-blocking Checkpointing System

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)