Fault Injection techniques for GPU Reliability Evaluation

hgpu.org » Applications » Computer science » Fault Injection techniques for GPU Reliability Evaluation

Fault Injection techniques for GPU Reliability Evaluation

Luigi Galasso

Politecnico di Torino

Politecnico di Torino, 2022

BibTeX

Download (PDF)

View

Source

1018

views

A Graphical Processing Unit (GPU) is a computer chip that renders graphics and images by performing rapid mathematical calculations. In recent years, GPUs are exploited for reasons beyond graphics processing as General Purpose GPUs (GPGPUs); they work as hardware accelerators for high-performance computing in many different fields, including safety-critical applications. In these domains, Convolutional Neural Networks (CNNs) represent a widely used computing approach, which is well supported by GPU, since they leverage data and thread-level parallelism. Considering this information, the reliability evaluation of GPUs is needed to meet desired requirements. To achieve this objective, it is necessary to study the GPU behavior in presence of hardware faults. In this thesis project, in particular, the presence of permanent faults affecting GPU functionalities has been analyzed. A permanent fault persists indefinitely after its occurrence: it manifests as stuck-at bits in the architecture that is, lines that always carry the logical signal "0" or "1". Those faults can be mimicked by injecting via software errors in the code running on the GPU; this could be obtained masking at assembly level one or more bit of a selected register before or after the corresponding instruction is executed. Therefore, in this work, it has been developed a framework, based on a binary instrumentation tool (NVBitFI), designed to properly perform permanent fault injection campaigns. Some injection techniques were elaborated to target distinct elements inside a GPU Streaming Multiprocessor: the Register Files and the Functional Units (Floating Point, Integer and Special Function Units). The presented environment has been used to test an NVIDIA GPU with a specific CNN target application, i.e., the LeNet model available in Darknet environment. To support the framework, many fault simulations were performed, and the obtained results were analyzed and compared.

Tags: Computer science, CUDA, Fault simulation, nVidia, Thesis

May 29, 2022 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org