Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems: Evaluating the Potential Gains and Systems Effects

hgpu.org » Applications » Computer science » Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems: Evaluating the Potential Gains and Systems Effects

Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems: Evaluating the Potential Gains and Systems Effects

Kalyan S. Perumalla, Alfred J. Park

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831

Cluster Computing, 2013

DOI:10.1007/s10586-013-0277-4

BibTeX

Download (PDF)

View

Source

2260

views

Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 580, Particle simulation, Performance, Tesla X2090

August 24, 2013 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org