9020

Performance Traps in OpenCL for CPUs

Jie Shen, Jianbin Fang, Henk Sips, Ana Lucia Varbanescu
Parallel and Distributed Systems Group, Delft University of Technology, Delft, The Netherlands
21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’13), 2013

@inproceedings{shen2013performance,

   author={Jie Shen and Jianbin Fang and Henk Sips and Ana Lucia Varbanescu},

   title={Performance Traps in OpenCL for CPUs},

   booktitle={Proceedings of the 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’13)},

   year={2013},

   month={February},

   location={Belfast, Northern Ireland, UK},

   url={http://www.pds.ewi.tudelft.nl/fileadmin/pds/homepages/shenjie/papers/Shen_PDP2013.pdf},

   topic={Parallel Programming},

   group={PDS}

}

Download Download (PDF)   View View   Source Source   

1003

views

With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: "OpenCL is not performance portable!" or "Why using OpenCL for CPUs after all?!". We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance "traps" that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL interplatform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: