GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications
Department of Electrical and Computer Engineering, University of British Columbia
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14), 2014
@article{fang2014gpu,
title={GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications},
author={Fang, Bo and Pattabiraman, Karthik and Ripeanu, Matei and Gurumurthi, Sudhanva},
year={2014}
}
While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient. This paper makes three key contributions. First, it presents the design of a fault-injection methodology to evaluate end-to-end reliability properties of application kernels running on GPUs. Second, it introduces a fault-injection tool that uses real GPU hardware and offers a good balance between the representativeness and the efficiency of the fault injection experiments. Third, this paper characterizes the error resilience characteristics of twelve GPGPU applications.
January 28, 2014 by hgpu