Exploiting Data Parallelism in GPUs
North Carolina State University
North Carolina State University, 2012
@article{zhang2012exploiting,
title={Exploiting Data Parallelism in GPUs.},
author={Zhang, Y.},
year={2012}
}
Mainstream microprocessor design no longer delivers performance boosts by increasing the processor clock frequency due to power and thermal constraints. Nonetheless, advances in semiconductor fabrication still allow the transistor density to increase at the rate of Moore’s law. This has resulted in the proliferation of many-core parallel architectures and accelerators, among which GPUs (graphics processing unit) quickly established themselves as suitable for applications that exploit fine-grained data-parallelism. GPU clusters are starting to make inroads into the HPC (high performance computing) domain as well, due to much better power per Flop (floating point operation) performance than general-purpose processors such as CPUs. Even though it is easier to program GPUs than ever, efficiently taking advantage of GPU resources requires unique techniques that are not found elsewhere. The traditional function level task-parallelism can hardly provide enough optimization opportunities for such parallel architectures. Instead, it is crucial to extract data-parallelism and map it to the massive threading execution model advocated by GPUs. This dissertation consists of multiple efforts to build programming models above existing models (CUDA) for single GPUs as well as GPU clusters. We start from manually implementing a flocking-based document clustering algorithm on GPU clusters. With this first-hand experience to write code directly above CUDA and MPI (message passing interface), we make several key observations: (1) Unified memory interface greatly enhances programmability, especially in GPU cluster environment, (2) explicit expression of data parallelism at language level facilitates the mapping of algorithms to massively parallel architectures and (3) auto-tuning is necessary to achieve competitive performance as the parallel architecture becomes more complex. Based on these observations, we propose several programming models and compiler approaches to achieve portability and programmability while retaining as much performance as possible. We propose GStream, a general-purpose, scalable data streaming framework on GPUs. We project powerful, yet concise language abstractions onto GPUs to fully exploit their inherent massive data-parallelism. We take a domain specific language approach to provide an efficient implementation of 3D iterative stencil computations on GPUs with auto-tuning capabilities. We propose CuNesl, a compiler framework to translate and optimize a nested data-parallel language called NESL into parallel CUDA programs for SIMT architectures. By converting recursive calls into while loops, we ensure that the hierarchical execution model in GPUs can be exploited on the "flattened" code. Finally, we design HiDP, a hierarchical data-parallel language suitable for hierarchical features of microprocessor architectures. We then develop a source-to-source compiler that converts HiDP into CUDA C++ source code with tuning capability. It greatly improves coding productivity while still keeping up with the performance of hand-coded CUDA code. The methods above cover a wide range of techniques for GPGPU computing and represent the current technology trend to exploit data parallelism in state-of-the-art GPUs.
October 14, 2012 by hgpu