Design Space Exploration of OpenCL Applications on Heterogeneous Parallel Platforms
Politecnico Di Milano, Dept. of Electronics, Information and Bioengineering
Politecnico Di Milano, 2014
@phdthesis{paone2014design,
title={Design space exploration of openCL applications on heterogeneous parallel platforms},
author={Paone, Edoardo},
year={2014},
school={Italy}
}
Parallel programming is a skill which software engineers no longer can do without, since multi- and many-core architectures have been widely adopted for general-purpose computing platforms. In 2006 Intel introduced the first multi-core processor on the consumer market and, at the same time, NVIDIA unveiled CUDA, a programming paradigm to exploit Graphics Processing Units (GPUs) for general purpose computing. Some years later (2008) the Khronos Consortium released the first specification of OpenCL, an open cross-platform API, inspired by CUDA, to efficiently exploit data-level parallelism while enabling application portability across different computing platforms. This API ensures functional portability of applications, so the same code can be compiled and executed on multi-core CPUs as well as GPGPUs, but also synthesized and deployed on FPGAs. However, additional fine-tuning of the application code might be needed in order to take the most performance out of a specific architecture. Another important aspect for application optimization is the exploitation of architectural heterogeneity on modern computing platforms. The convergence to globally heterogeneous locally homogeneous parallel architectures, in both the embedded and High Performance Computing (HPC) domains, leads to a rapid overlapping of the challenges related to the efficient exploitation of the available computing devices. In particular, mapping of application tasks cannot be considered independently from specific optimization of each task, besides accounting for the overhead of data transfers and synchronization between different devices. Code customization and task mapping represent the main challenge for the design and optimization of OpenCL applications targeted to heterogeneous parallel platforms. At this aim, the Design Space Exploration (DSE) methodology presented in this thesis allows to efficiently explore the customization options of a parametric OpenCL application design. On the one hand, the proposed techniques reduce the exploration time on simulation platforms while providing close-to-optimal solutions; on the other hand, they exploit platform-specific constraints to prune out unfeasible solutions from the design space. Task-level parallelism could be combined with request-level parallelism, in order to deploy and run multiple independent OpenCL applications on the same platform. This application scenario is enabled by the increasing number of cores integrated in the same chip, however modern platforms still lack run-time management to support applications with resource sharing. Thus, multi-application use cases often suffer from resource contention and performance degradation, especially under dynamic workload variations. The contribution of this thesis consists of the DSE support to generate Pareto-optimal application configurations, with different trade-offs between performance and QoS. A run-time management technique is presented, which exploits the knowledge-base gathered at design-time to implement effective application auto-tuning. By including platform metrics and resource utilization in the optimization phase, this methodology also supports performance-aware scheduling on multi-core platforms and improves the overall system performance with respect to soft real-time constraints. The proposed design methodology and runtime software layer have been implemented and demonstrated on a real case-study – an OpenCL stereo-matching application – targeting different industrial platforms.
January 30, 2015 by hgpu