high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

Bo Fang, Karthik Pattabiraman, Matei Ripeanu, Sudhanva Gurumurthi

Department of Electrical and Computer Engineering, University of British Columbia

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14), 2014

BibTeX

Download (PDF)

View

Source

1805

views

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient. This paper makes three key contributions. First, it presents the design of a fault-injection methodology to evaluate end-to-end reliability properties of application kernels running on GPUs. Second, it introduces a fault-injection tool that uses real GPU hardware and offers a good balance between the representativeness and the efficiency of the fault injection experiments. Third, this paper characterizes the error resilience characteristics of twelve GPGPU applications.

Tags: Computer science, CUDA, GPGPU-sim, nVidia

January 28, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

Your response

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)

GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)