high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Program Optimization of Stencil Based Application on the GPU-Accelerated System

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Guibin Wang, Xuejun Yang, Ying Zhang, Tao Tang, XuDong Fang

Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China

IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009

DOI:10.1109/ISPA.2009.70

BibTeX

Source

1557

views

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2 K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2 x speedup compared to the simple mapping version, and get as high as 34.3 x speedup when compared with a CPU implementation.

Tags: Computer science, Optimization