PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

hgpu.org » Applications » Computer science » PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Xin-Hai Xu, Xue-Jun Yang, Jing-Ling Xue, Yu-Fei Lin, Yi-Song Lin

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha 410073, China

Journal of Computer Science and Technology, Volume 27, Issue 2, 240-255, 2012

DOI:10.1007/s11390-012-1220-5

BibTeX

Download (PDF)

View

Source

1969

views

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

Tags: Computer science, CUDA, Fault tolerance, Heterogeneous systems, nVidia, nVidia GeForce GTX 295

March 6, 2012 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)