28320

Implementation Techniques for SPMD Kernels on CPUs

Joachim Meyer, Aksel Alpay, Sebastian Hack, Holger Fröning, Vincent Heuveline
Compiler Design Lab, Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Proceedings of the 2023 International Workshop on OpenCL (IWOCL ’23), 2023

@inproceedings{meyer2023implementation,

   title={Implementation Techniques for SPMD Kernels on CPUs},

   author={Meyer, Joachim and Alpay, Aksel and Hack, Sebastian and Fr{"o}ning, Holger and Heuveline, Vincent},

   booktitle={Proceedings of the 2023 International Workshop on OpenCL},

   pages={1–12},

   year={2023}

}

More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides guidance for implementations of single-program multiple-data (SPMD) programming models, such as OpenCL, SYCL, CUDA, or HIP, on non-SPMD devices, such as CPUs. We discuss the trade-offs of multiple approaches to handling work-group-level barriers. We present our experience with the integration of two known compiler-based approaches for low-overhead work-group synchronization on CPUs. Thereby we discuss a general design flaw in deep loop fission approaches, as used in the popular Portable Computing Language (PoCL) project, that makes them miscompile certain kernels. For our evaluation, we integrate PoCL’s “loopvec” kernel compiler into hipSYCL and implement continuation-based synchronization (CBS) in the same. We compare both against hipSYCL’s library-only fiber implementation using diverse hardware: we use recent AMD Rome and Intel Icelake server CPUs but also two Arm server CPUs, namely Fujitsu’s A64FX and Marvell’s ThunderX2. We show that compiler-based approaches outperform library-only implementations by up to multiple orders of magnitude. Further, we adapt our CBS implementation into PoCL and compare it against its loopvec approach in both, PoCL and hipSYCL. We find that our implementation of CBS, while being more general than PoCL’s approach, gives comparable performance in PoCL and even surpasses it in hipSYCL. Therefore we recommend its use in general.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: