Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System

Mario Kicherer, Wolfgang Karl
Chair for Computer Architecture and Parallel Processing, Karlsruhe Institute of Technology
arXiv:1405.2912 [cs.OS], (12 May 2014)



Download Download (PDF)   View View   Source Source   



Due to the diversity and implicit redundancy in terms of processing units and compute kernels, off-the-shelf heterogeneous systems offer the opportunity to detect and tolerate faults during task execution in hardware as well as in software. To automatically leverage this diversity, we introduce an extension of an online-learning runtime system that combines the benefits of the existing performance-oriented task mapping with task duplication, a diversity-oriented mapping strategy and heterogeneity-aware majority voter. This extension uses a new metric to dynamically rate the remaining benefit of unreliable processing units and a memory management mechanism for automatic data transfers and checkpointing in the host and device memories.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: