high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Increased reliability on Intel GPUs via software diverse redundancy

Increased reliability on Intel GPUs via software diverse redundancy

Nikolaos Andriotis

Universitat Politècnica de Catalunya, BarcelonaTech

Facultat d’Informàtica de Barcelona (FIB), Universitat Politècnica de Catalunya (UPC), BarcelonaTech

@mastersthesis{andriotis2023increased,

title={Increased reliability on Intel GPUs via software diverse redundancy},

author={Andriotis, Nikolaos},

year={2023},

school={Universitat Polit{`e}cnica de Catalunya}

}

Download (PDF)

View

Source

1140

views

During the past decade, the industry revolutionized its processes by including Artificial Intelligence. Nowadays, this revolutionary process extends from the manufacturing industry to more critical sectors, such as the avionics, automotive, or health industry, where errors are unacceptable. One clear example of this process is the automotive industry, where the installation of Advanced Driver Assistance Systems (ADAS) is now a reality, and the aim is to achieve fully self-driving cars (SDCs) in the near future. This new emerging domain has been increasing the interest of researchers in ADAS and Autonomous Driving (AD) systems, as these domains require processing high volumes of data using complex algorithms (Deep Learning (DL)) at high frequency to meet highly tight time constraints (Real Time (RT)). In this context, traditional computing quickly became a bottleneck due to CPUs being unable to handle such amount of data and process it on time. In contrast, high-performance graphics processing units (GPUs) have recently provided the required computing performance and partially fulfilled the timing constraints. Thus, electronic manufacturers continuously innovate to improve their devices’ performance by introducing state of the art GPUs that are equipped with new accelerators as well as enhancing their GPUs in terms of performance and efficiency (i.e., performance per Watt). For instance, Nvidia introduced in 2017 Jetson AGX Xavier SoC, a GPU-based low power device designed mainly for accelerating machine learning applications, and focused on the automotive sector. However, AD or ADAS challenges are not only related to the performance or the timing constraints; another constraint to satisfy is safety. Critical systems, such as AD or ADAS, have to provide the correct outcome on their computation as people’s life depends on them. In this sense, the AD sector has an additional constraint: functional safety. Functional safety problems have been long studied, and the only way to address them is through redundancy to identify or correct the erroneous outcome. Additionally, to ensure the highest safety levels, these systems introduce diversity to avoid redundant computation getting compromised at the same point and the errors going undetected (common cause faults (CCF)). To ensure that the high-performance hardware used for AD is working as expected and that specific safety goals are met, specific hardware support is included to realize safety measures, and exhaustive verification and validation (V&V) processes are carried out. These verification processes are incredibly costly, especially when custom hardware is used, and the design and fabrication of such hardware is also an onerous task. As a result, the automotive industry tries to avoid these non-recurring costs by targeting widespread and cheap hardware, i.e. commercial off-the-shelf products (COTS). However, COTS devices present a drawback, manufacturers are reluctant to provide redundant hardware to end users due to the high costs, power consumption, and low-performance ratio. In addition, they jealously guard, in most cases, the implementation details, which limits the adoption of the industry that requires reliable computation. Therefore, the hardware limits the redundancy by design and thus extends the functional safety requirements beyond the boundaries of the hardware layers to the entire software stacks on such devices. In this sense, researchers have to deal with the limitations of COTS solutions and build more affordable and promising software-based solutions, especially to realize diverse redundancy so that, even if a single fault leads to error all replicas, by being diverse, errors are also diverse and can be detected using comparison. Thus, software-only diverse redundancy solutions have to be deployed on top of COTS solutions and deal with two main limitations: 1) computation needs to occur redundantly to enable error detection, and 2) redundancy must be guaranteed to occur with diversity to guarantee that, even if an error affects all replicas (e.g., affecting the clock or power networks), errors differ and can be detected, hence avoiding the so-called Common Cause Failures (CCFs). For instance, COTS GPUs lack of explicit hardware devoted for diverse redundancy; thus, software-based solutions are being developed, but most of the current implementations provide limited guarantees and have only been focused on NVIDIA brand. In contrast, this thesis presents a software-only solution to enable diverse redundancy on Intel GPUs, achieving strong guarantees on the diversity provided for the first time. One key characteristic of this solution is that it is built on top of OpenCL, a hardware-agnostic programming language. This programming language allows it to be expanded using some special functions that the compiler handles, the so-called intrinsics. These functions are implementation-dependent and highly optimized, meaning integrators should provide them. For instance, the intrinsics used in this thesis allow identifying the hardware thread of the GPU where any given software executes, which allows performing smart tailoring of the workload geometry and allocation to specific computing elements inside the GPU. As a result, redundant threads are guaranteed to use physically diverse execution units, hence meeting diverse redundancy requirements with affordable performance overheads. The technique is based on the fact that it issues as many software threads as available HW threads in the GPU, then allocates half of them for executing one kernel and the other half for executing the redundant one. To reach the final diverse and redundant solution, several scenarios are developed to efficiently measure the impact of each step of our modifications to a normal OpenCL kernel execution. At first, only half of our available GPU resources are allocated, allowing one kernel to run wherever the scheduler decides. Then the scheduler is overridden and forced to use half of the resources, forcing only one independent part of the GPU to be used (in this way, the overhead for having a HW-thread aware work allocation is evaluated). Subsequently, duplicating the work (to mimic the two kernel execution) is applied, and lastly, both kernels are forced to be executed in independent parts of the GPU.

Tags: Computer science, Deep learning, Machine learning, MPI, nVidia, OpenCL, Thesis

August 20, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Increased reliability on Intel GPUs via software diverse redundancy

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Increased reliability on Intel GPUs via software diverse redundancy

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)