high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving GPU Robustness by Making Use of Faulty Parts

Improving GPU Robustness by Making Use of Faulty Parts

Artem Durytskyy, Mohamed Zahran, Ramesh Karri

ECE Department, Polytechnic Institute of NYU, New York, NY

IEEE 29th International Conference on Computer Design (ICCD), 2011

DOI:10.1109/ICCD.2011.6081422

BibTeX

Download (PDF)

View

Source

1963

views

With hundreds of processing units in current state-of-the-art graphics processing units (GPUs), the probability that one or more processing units fail due to permanent faults, during fabrication or post deployment, increases drastically. In our experiments we found that the loss of a single streaming multiprocessor (SM) in an 8-SM GPU resulted in as much as 16%performance loss. The default method for dealing with faulty SMs is to turn them off. Although faulty SMs cannot be trusted to completely execute a single kernel (program assigned to an SM) correctly, we show that we can still make use of these SMs to improve system throughput by generating and supplying high-level hints to other functional SMs. By making the faulty SMs supply hints to functional SMs, we have been able to achieve an average speed-up of about 16 % over the baseline case (wherein the faulty SMs are turned off). The proposed technique requires minimal hardware overhead and is highly scalable.

Tags: Computer science, Fault simulation, Hardware Architecture, nVidia, Performance

December 16, 2011 by hgpu

No votes yet.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

Improving GPU Robustness by Making Use of Faulty Parts

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)

Improving GPU Robustness by Making Use of Faulty Parts

Share this:

Recent source codes

Most viewed papers (last 30 days)