high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing

G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing

Felix Loh, Matt Sinclair

The University of Wisconsin-Madison

ECE 753 Project Progress Report Spring 2010

BibTeX

Download (PDF)

View

Source

2183

views

GPUs have become increasingly popular in recent years, in large part due to their potential to offer a large amount of computational power at low prices. GPU designers have also made GPU pipelines more general purpose and more programmable, which has made GPUs more attractive to a wider audience. Thus, it is increasingly important to provide fault tolerance in GPUs. However, pre-Fermi Nvidia GPUs do not provide fault tolerance. Since GPUs are now often used in high performance computing and other general purpose application domains where data integrity is important, providing fault tolerance on GPUs is becoming increasingly important. In this project, we present G-CP, a mechanism for providing fault tolerance support in GPUs through use of software checkpointing combined with time and space redundancy. In this way, GPU algorithms will be able to periodically checkpoint their work. If a fault has occurred, then the user can roll back to the last checkpoint and continue executing.

Tags: Algorithms, Computer science, CUDA, Fault tolerance, nVidia, Tesla C1060

December 25, 2011 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

G-CP: Providing Fault Tolerance on the GPU through Software Checkpointing

Share this:

Recent source codes

Most viewed papers (last 30 days)