Evaluating the Performance and Portability of OpenCL
Electronic Systems Group, Faculty of Electrical Engineering, Eindhoven University of Technology
Eindhoven University of Technology, 2011
@article{van2011evaluating,
title={Evaluating the Performance and Portability of OpenCL},
author={van der Sanden, J.},
year={2011}
}
Recent developments in processor architecture have settled a shift from sequential processing to parallel processing. This shift was not based on a breakthrough in processor design, but was actually an alternative design trajectory to avoid the limits that were reached on single core development. Along with the shift towards parallel architectures, a gap arose between sequential programmers and parallel architectures. Several efforts from industry have tried to bridge the gap, resulting in parallel programming frameworks such as CUDA, OpenMP and the Cell SDK. A more recent parallel programming standard is OpenCL, as offered by the Khronos Group. OpenCL uniquely distinguishes itself by offering the programmer a single flexible programming framework, which can be used to target multiple platforms from different vendors. To what extent OpenCL is a suitable substitute for current programming standards is the main topic of interest in this thesis. It includes a detailed comparison and analysis of the performances of several image-processing algorithms implemented in both CUDA and OpenCL, and mapped onto an NVIDIA GPU. Despite the similarity of OpenCL and CUDA, performance differences up to 16% are observed. Furthermore, the suitability of OpenCL as a single standard for targeting multiple platforms is investigated by mapping and optimizing the image-processing algorithms to other architectures, including an AMD GPU and an Intel multi-core CPU. Cross-platform OpenCL mappings show that functional portability as well as performance portability cannot always be guaranteed. A method is proposed to improve the performance portability by developing a single OpenCL implementation that ports to multiple target devices, reaching at least 80% of the performances of the optimal implementations on the target devices.
October 18, 2011 by hgpu