26835

Fault Injection techniques for GPU Reliability Evaluation

Luigi Galasso
Politecnico di Torino
Politecnico di Torino, 2022

@article{reorda2022fault,

   title={Fault Injection techniques for GPU Reliability Evaluation},

   author={Reorda, Matteo Sonza and Galasso, Luigi},

   year={2022}

}

Download Download (PDF)   View View   Source Source   

750

views

A Graphical Processing Unit (GPU) is a computer chip that renders graphics and images by performing rapid mathematical calculations. In recent years, GPUs are exploited for reasons beyond graphics processing as General Purpose GPUs (GPGPUs); they work as hardware accelerators for high-performance computing in many different fields, including safety-critical applications. In these domains, Convolutional Neural Networks (CNNs) represent a widely used computing approach, which is well supported by GPU, since they leverage data and thread-level parallelism. Considering this information, the reliability evaluation of GPUs is needed to meet desired requirements. To achieve this objective, it is necessary to study the GPU behavior in presence of hardware faults. In this thesis project, in particular, the presence of permanent faults affecting GPU functionalities has been analyzed. A permanent fault persists indefinitely after its occurrence: it manifests as stuck-at bits in the architecture that is, lines that always carry the logical signal "0" or "1". Those faults can be mimicked by injecting via software errors in the code running on the GPU; this could be obtained masking at assembly level one or more bit of a selected register before or after the corresponding instruction is executed. Therefore, in this work, it has been developed a framework, based on a binary instrumentation tool (NVBitFI), designed to properly perform permanent fault injection campaigns. Some injection techniques were elaborated to target distinct elements inside a GPU Streaming Multiprocessor: the Register Files and the Functional Units (Floating Point, Integer and Special Function Units). The presented environment has been used to test an NVIDIA GPU with a specific CNN target application, i.e., the LeNet model available in Darknet environment. To support the framework, many fault simulations were performed, and the obtained results were analyzed and compared.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: