Adding fault tolerance to OpenCL: Through redundant heterogeneous computing
TU Delft Electrical Engineering, Mathematics and Computer Science
Delft University of Technology, 2023
@article{bijl2023adding,
title={Adding fault tolerance to OpenCL: Through redundant heterogeneous computing},
author={Bijl, Robin},
year={2023}
}
The ever-increasing demand for computing has led to the need for specialized heterogeneous hardware, and the frameworks required to utilize them. Besides the traditional central processing units, more and more programs will make use of specialized hardware to accelerate computations. However, the increase in computing also leads to shorter mean time between failures. In this thesis, we apply fault tolerance to Portable Computing Language (PoCL), an open-source implementation of the OpenCL standard. We show that our solution is easy to apply to existing programs making use of PoCL/OpenCL and is able to greatly reduce the total number of errors visible to the end user. Our solution can be used on any device supported by PoCL and provides a low overhead, given that the hardware requirements are met.
December 31, 2023 by hgpu