high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Twinkle Jain, Gene Cooperman

Khoury College of Computer Sciences, Northeastern University, Boston, USA

arXiv:2008.10596 [cs.DC], (24 Aug 2020)

@misc{jain2020crac,

title={CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM},

author={Twinkle Jain and Gene Cooperman},

year={2020},

eprint={2008.10596},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

Source codes

Package:

CRAC: Checkpoint-Restart Architecture for CUDA Streams and UVM

2014

views

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer’s burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process’s memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.

Tags: Computer science, CUDA, HPC, nVidia, nVidia Quadro K600, Package, Tesla V100

August 30, 2020 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)