high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GPUburn: A System to Test and Mitigate GPU Hardware Failures

GPUburn: A System to Test and Mitigate GPU Hardware Failures

David Defour, Eric Petit

Laboratoire d’Informatique de Robotique et de Microelectronique de Montpellier (LIRMM)

hal-00827588, 2013

@techreport{defour:hal-00827588,

hal_id={hal-00827588},

url={http://hal.archives-ouvertes.fr/hal-00827588},

title={GPUburn: A System to Test and Mitigate GPU Hardware Failures},

author={Defour, David and Petit, Eric},

keywords={GPU; GPGPU; CUDA; OpenCL; fault-tolerance; soft-error; microbenchmark},

language={Anglais},

affiliation={Laboratoire d’Informatique de Robotique et de Micro{‘e}lectronique de Montpellier – LIRMM , Laboratoire de Recherche Commun "Innovation in Teracomputing and Computing Algorithms" – LRC ITACA},

note={Paper accepted in SAMOS 2013},

year={2013},

month={May},

pdf={http://hal.archives-ouvertes.fr/hal-00827588/PDF/GPUburn_SAMOS.pdf}

}

Download (PDF)

View

Source

3427

views

Due to many factors such as, high transistor density, high frequency, and low voltage, today’s processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.

Tags: Benchmarking, Computer science, CUDA, Fault tolerance, nVidia, OpenCL, Tesla D870

June 2, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

GPUburn: A System to Test and Mitigate GPU Hardware Failures

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

GPUburn: A System to Test and Mitigate GPU Hardware Failures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)