Automatic Code Generation for Stencil Computations on GPU Architectures

Justin Andrew Holewinski
The Ohio State University
The Ohio State University, 2012

   title={Automatic Code Generation for Stencil Computations on GPU Architectures},

   author={Holewinski, J.A.},


   school={Ohio State University}


Download Download (PDF)   View View   Source Source   



The development of parallel architectures is now nearly ubiquitous in not only the high-performance computing field, but also the commodity electronics market. Even embedded processors found in cell phones and tablet computers are starting to incorporate parallel architectures. These architectures are exploiting both SIMD (Single-Instruction Multiple-Data) and SIMT (Simple- Instruction Multiple-Thread) parallelism to achieve higher levels of performance that was previously possible. Additionally, multiprocessors are becoming increasingly heterogeneous by incorporating different architectures into the same die, such as NVIDIA’s Tegra and AMD’s Fusion APU. As the computer hardware industry moves to increasingly parallel and heterogeneous architectures, the computer software industry has been forced to make drastic changes to the way software is designed and developed. This has led to an increasing burden on software developers to not only write bug-free software, but also to scale software performance with the diverse architectures. Multi-processors are increasingly exposing SIMD parallelism as a way to improve the per-core performance of the hardware without requiring significant clock speed increases. Vector instruction sets are using larger vector widths to increase the parallelism available. Intel’s SSE instruction set on x86 uses 128-bit vector and instructions that can operate on 4 single-precision or 2 double-precision floating-point numbers at a time. Recently, the AVX instruction set on x86 extends to 256-bit vectors and instructions that can operate on 8 single-precision or 4 double-precision floating-point numbers at a time. Exploiting the SIMD parallelism available in modern hardware can be a difficult process for software developers. Vector instruction sets often impose limitations such as alignment restrictions on vector loads/stores and a lack of scatter/gather operations that make it a non-trivial process to convert scalar code into higher-performance vector code. In the first part of this dissertation, we present a method for automatically finding sections of scalar application code that would likely benefit from SIMD vectorization. Many-core architectures such as those found in Graphics Processing Units (GPUs) have become good targets for high-performance applications such as those found in scientific computing. Modern high-end GPUs have a theoretical floating-point throughput of over 2 TFlop/s, making them prime targets for scientific computing. Programming environments such as NVIDIA’s CUDA, Khronos’ OpenCL, and Microsoft’s DirectCompute allow application developers to write software that executes directly on GPU hardware, but their abstractions are very close to the actual hardware and are complex for developers to use. To take advantage of GPU devices, application developers must first determine which parts of their applications would benefit from GPU acceleration, then port those parts of their application to CUDA, OpenCL, or DirectCompute. After porting their code, significant optimization effort is then often required to maximize performance of the code, separate from any previous effort spent on optimization of the GPU versions. This makes GPU programming very complex for computational scientists and other software writers that do not have a background in computer architecture and GPU programming models. In the second part of this dissertation, we present an automatic code generation framework for stencil computations on GPU devices. Stencil computations are important parts of many scientific applications, including PDE solvers, grid-based simulations, and image processing. Our code generation framework takes a high- level description of the stencil problem and generates high-performance code for a variety of GPU architectures. The performance of GPU programs is often highly dependent on the choice of thread block and tile size. The optimal choice of block and tile size can be different based on the characteristics of the GPU program. The code generation framework for stencil programs proposed in this work is parameterized on the choice of block and tile size, and the choice of block and tile size can have a large impact on performance. In the third part of this dissertation, we explore the effects of the choice of block and tile size on the performance of stencil programs generated by our code generation framework and propose a performance model that uses a description of the stencil program and the target GPU hardware to automatically select an optimal block and tile size.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1496 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

252 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: