Improving the Programmability of GPU Architectures
Technische Universiteit Eindhoven, NVIDIA
Cedric Nugteren, 2014
@phdthesis{nugteren2014improving,
title={Improving the Programmability of GPU Architectures},
author={Nugteren, Cedric},
year={2014}
}
Throughout the past decades, the tremendous growth of single-core performance has been the key-enabler for digital technology to become ubiquitous in our society. Recently, diminishing returns on Dennard scaling resulted in power dissipation issues, leading to reduced performance growth. Performance growth has since been re-enabled by multi-core processors as well as by exploiting the energy efficiency of specialised accelerators such as graphics processing units (GPUs). This has led to a heterogeneous and parallel computing environment, making programming a challenging task. Programmers are faced with a variety of new languages and are required to deal with the architecture’s parallelism and memory hierarchy. This has become increasingly important, especially considering the memory wall and the prospect of dark silicon. Apart from programming, issues such as code maintainability and portability have become of major importance. To address these issues, this thesis first introduces algorithmic species: a classification of program code based on memory access patterns. Algorithmic species is a structured classification that programmers and compilers can use for example to take parallelisation decisions or to perform memory access optimisations. The algorithmic species classification is used in a skeleton-based compiler to automatically generate efficient and readable code for GPUs and other parallel processors. To do so, C code is first automatically annotated with species information. The annotated code is subsequently fed into bones, a source-to-source compiler that provides pre-optimised code templates ("skeletons") for specific algorithmic species. By applying traditional and species-based optimisations such as thread coarsening and kernel fusion on top of this, bones is able to generate competitive code. Combining skeletons with a program code classification (the species) creates a unique code generation approach, integrating a skeleton-based compiler into an automated compilation flow for the first time. Furthermore, this thesis proposes to change the GPU’s thread scheduling mechanism to improve its programmability. Programming models for GPUs allow programmers to specify the independence of threads, removing ordering constraints. Still, GPUs do not exploit the potential for locality (e.g. improving cache performance) enabled by this independence: threads are scheduled in a fixed order. This thesis quantifies the potential of scheduling in a "locality-aware" manner. A detailed reuse-distance based cache model for GPUs is introduced to provide better insight into locality and cache behaviour.
May 7, 2014 by hgpu