high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Low-overhead diskless checkpoint for hybrid computing systems

Low-overhead diskless checkpoint for hybrid computing systems

Leonardo Bautista Gomez, Akira Nukada, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka

Tokyo Institute of Technology, Tokyo, Japan

International Conference on High Performance Computing (HiPC), 2010

DOI:10.1109/HIPC.2010.5713163

@inproceedings{gomez2011low,

title={Low-overhead diskless checkpoint for hybrid computing systems},

author={Gomez, L.B. and Nukada, A. and Maruyama, N. and Cappello, F. and Matsuoka, S.},

booktitle={High Performance Computing (HiPC), 2010 International Conference on},

pages={1–10},

year={2011},

organization={IEEE}

}

Download (PDF)

View

Source

1330

views

As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. CheckpointRestart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources such us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, Programming techniques

July 20, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Low-overhead diskless checkpoint for hybrid computing systems

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Low-overhead diskless checkpoint for hybrid computing systems

Share this:

Recent source codes

Most viewed papers (last 30 days)