high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A high-performance fault-tolerant software framework for memory on commodity GPUs

A high-performance fault-tolerant software framework for memory on commodity GPUs

Naoya Maruyama, Akira Nukada, Satoshi Matsuoka

GSIC, Tokyo Institute of Technology, JST CREST

IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010

DOI:10.1109/IPDPS.2010.5470473

@conference{maruyama2010high,

title={A high-performance fault-tolerant software framework for memory on commodity GPUs},

author={Maruyama, N. and Nukada, A. and Matsuoka, S.},

booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},

pages={1–12},

issn={1530-2075},

organization={IEEE}

}

Download (PDF)

View

Source

2298

views

As GPUs are increasingly used to accelerate HPC applications by allowing more flexibility and programmability, their fault tolerance is becoming much more important than before when they were used only for graphics. The current generation of GPUs, however, does not have standard error detection and correction capabilities, such as SEC-DED ECC for DRAM, which is almost always exercised in HPC servers. We present a high-performance software framework to enhance commodity off-the-shelf GPUs with DRAM fault tolerance. It combines data coding for detecting bit-flip errors and checkpointing for recovering computations when such errors are detected. We analyze performance of data coding in GPUs and present optimizations geared toward memory-intensive GPU applications. We present performance studies of the prototype implementation of the framework and show that the proposed framework can be realized with negligible overheads in compute intensive applications such as N-body problem and matrix multiplication, and as low as 35% in a highly-efficient memory intensive 3-D FFT kernel.

Tags: Computer science, CUDA, Fault simulation, FFT, Matrix multiplication, N-body simulation, nVidia, nVidia GeForce 8800 GTS, nVidia GeForce GTX 285, Tesla S1070

April 14, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

A high-performance fault-tolerant software framework for memory on commodity GPUs

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

A high-performance fault-tolerant software framework for memory on commodity GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)