high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Low-overhead diskless checkpoint for hybrid computing systems

Low-overhead diskless checkpoint for hybrid computing systems

Leonardo Bautista Gomez, Akira Nukada, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka

Tokyo Institute of Technology, Tokyo, Japan

International Conference on High Performance Computing (HiPC), 2010

DOI:10.1109/HIPC.2010.5713163

BibTeX

Download (PDF)

View

Source

1679

views

As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. CheckpointRestart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources such us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, Programming techniques

July 20, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Low-overhead diskless checkpoint for hybrid computing systems

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Low-overhead diskless checkpoint for hybrid computing systems

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)