Exploiting Data Parallelism in GPUs

Yongpeng Zhang
North Carolina State University
North Carolina State University, 2012

   title={Exploiting Data Parallelism in GPUs.},

   author={Zhang, Y.},



Download Download (PDF)   View View   Source Source   



Mainstream microprocessor design no longer delivers performance boosts by increasing the processor clock frequency due to power and thermal constraints. Nonetheless, advances in semiconductor fabrication still allow the transistor density to increase at the rate of Moore’s law. This has resulted in the proliferation of many-core parallel architectures and accelerators, among which GPUs (graphics processing unit) quickly established themselves as suitable for applications that exploit fine-grained data-parallelism. GPU clusters are starting to make inroads into the HPC (high performance computing) domain as well, due to much better power per Flop (floating point operation) performance than general-purpose processors such as CPUs. Even though it is easier to program GPUs than ever, efficiently taking advantage of GPU resources requires unique techniques that are not found elsewhere. The traditional function level task-parallelism can hardly provide enough optimization opportunities for such parallel architectures. Instead, it is crucial to extract data-parallelism and map it to the massive threading execution model advocated by GPUs. This dissertation consists of multiple efforts to build programming models above existing models (CUDA) for single GPUs as well as GPU clusters. We start from manually implementing a flocking-based document clustering algorithm on GPU clusters. With this first-hand experience to write code directly above CUDA and MPI (message passing interface), we make several key observations: (1) Unified memory interface greatly enhances programmability, especially in GPU cluster environment, (2) explicit expression of data parallelism at language level facilitates the mapping of algorithms to massively parallel architectures and (3) auto-tuning is necessary to achieve competitive performance as the parallel architecture becomes more complex. Based on these observations, we propose several programming models and compiler approaches to achieve portability and programmability while retaining as much performance as possible. We propose GStream, a general-purpose, scalable data streaming framework on GPUs. We project powerful, yet concise language abstractions onto GPUs to fully exploit their inherent massive data-parallelism. We take a domain specific language approach to provide an efficient implementation of 3D iterative stencil computations on GPUs with auto-tuning capabilities. We propose CuNesl, a compiler framework to translate and optimize a nested data-parallel language called NESL into parallel CUDA programs for SIMT architectures. By converting recursive calls into while loops, we ensure that the hierarchical execution model in GPUs can be exploited on the "flattened" code. Finally, we design HiDP, a hierarchical data-parallel language suitable for hierarchical features of microprocessor architectures. We then develop a source-to-source compiler that converts HiDP into CUDA C++ source code with tuning capability. It greatly improves coding productivity while still keeping up with the performance of hand-coded CUDA code. The methods above cover a wide range of techniques for GPGPU computing and represent the current technology trend to exploit data parallelism in state-of-the-art GPUs.
VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)
Exploiting Data Parallelism in GPUs, 5.0 out of 5 based on 1 rating

* * *

* * *

Follow us on Twitter

HGPU group

1496 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

255 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: