CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

hgpu.org » Applications » Computer science » CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu,

Graduate School of Information Sciences, Tohoku University, 6-3 Aramaki-aza-aoba, Sendai, 980-8578 Japan

International Conference on Parallel and Distributed Computing, Applications and Technologies, 2009, p.408-413

DOI:10.1109/PDCAT.2009.78

BibTeX

Download (PDF)

View

Source

2210

views

In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

Tags: Computer science, CUDA, High-level Languages, nVidia, nVidia GeForce 8600 M GT, nVidia GeForce 8800 GTX, nVidia GeForce GTX 280

January 17, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Share this:

Recent source codes

Most viewed papers (last 30 days)