Performance Portability in Accelerated Parallel Kernels
University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing
University of Illinois at Urbana-Champaign, IMPACT Technical Report IMPACT-13-01, 2013
@techreport{stratton:12:parboil2.5,
author={Stratton, John A. and Kim, Hee-Seok and Jablin, Thomas B. and Hwu, Wen-Mei W.},
title={Performance Portability in Accelerated Parallel Kernels},
institution={University of Illinois at Urbana-Champaign},
number={IMPACT-13-01},
address={Urbana},
year={2013},
month={may}
}
Heterogeneous architectures, by definition, include multiple processing components with very different microarchitectures and execution models.In particular, computing platforms from supercomputers to smartphones can now incorporate both CPU and GPU processors. Disparities between CPU and GPU processor architectures have naturally led to distinct programming models and development patterns for each component.Developers for a specific system decompose their application, assign different parts to different heterogeneous components, and express each part in its assigned component’s native model.But without additional effort, that application will not be suitable for another architecture with a different heterogeneous component balance.Developers addressing a variety of platforms must either write multiple implementations for every potential heterogeneous component or fall back to a "safe" CPU implementation, incurring a high development cost or loss of system performance, respectively.The disadvantages of developing for heterogeneous systems are vastly reduced if one source code implementation can be mapped to either a CPU or GPU architecture with high performance. A convention has emerged from the OpenCL community defining how to write kernels for performance portability among different GPU architectures.This paper demonstrates that OpenCL programs written according to this convention contain enough abstract performance information to enable effective translations to CPU architectures as well.The challenge is that an OpenCL implementation must focus on those programming conventions more than the most natural mapping of the language specification to the target architecture.In particular, prior work implementing OpenCL on CPU platforms neglects the OpenCL kernel’s implicit expression of performance properties such as spatial or temporal locality.We outline some concrete transformations that can be applied to an OpenCL kernel to suitably map the abstract performance properties to CPU execution constructs. We show that such transformations result in marked performance improvements over existing CPU OpenCL implementations for GPU-portable OpenCL kernels.Ultimately, we show that the performance of GPU-portable OpenCL kernels, when using our methodology, is comparable to the performance of native multicore CPU programming models such as OpenMP.
May 27, 2013 by hgpu