Compiler-Driven Performance on Heterogeneous Computing Platforms
Department of Computing Science, University of Alberta
University of Alberta, 2019
@article{chikin2019compiler,
title={Compiler-Driven Performance on Heterogeneous Computing Platforms},
author={Chikin, Artem},
year={2019}
}
Modern parallel programming languages such as OpenMP provide simple, portable programming models that support offloading of computation to various accelerator devices. Coupled with the increasing prevalence of heterogeneous computing platforms and the battle for supremacy in the co-processor space, gives rise to additional challenges placed on compiler/runtime vendors to handle the increasing complexity and diversity of shared-memory parallel platforms.To start, this thesis presents three kernel re-structuring ideas that focus on improving the execution of high-level parallel code in GPU devices. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple regions, leading to the launching of multiple kernels, onefor each parallel region. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfers. This transformation enables the overlap of communication and computation. The third idea is that the selection of a grid geometry for the execution of a parallelregion must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.This thesis next describes the Iteration Point Difference Analysis — a new static-analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improvingtheir memory-access characteristics. GPU kernel execution time across the Polybench suite is improved by up to 25.5x on an Nvidia P100 with benchmark overall improvement of up to 3.2x. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5x with a benchmark improvement of 3.4x, and a kernel speedup of 111.1x with a benchmark improvement of 2.3 on an Nvidia P100 and V100, respectively.The task of modelling performance takes on an ever increasing importance as systems must make automated decisions on the most suitable offloading target. The third contribution of this thesis motivates the need with a study of cross-architectural changes in profitability of kernel offloading to GPU versus host CPU execution, and presents a prototype design for a hybrid computing device selection framework.
November 17, 2019 by hgpu