Program Optimization of Stencil Based Application on the GPU-Accelerated System

Guibin Wang, Xuejun Yang, Ying Zhang, Tao Tang, XuDong Fang
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009


   title={Program optimization of stencil based application on the gpu-accelerated system},

   author={Wang, G. and Yang, X. and Zhang, Y. and Tang, T. and Fang, X.D.},

   booktitle={2009 IEEE International Symposium on Parallel and Distributed Processing with Applications},





Source Source   



Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2 K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2 x speedup compared to the simple mapping version, and get as high as 34.3 x speedup when compared with a CPU implementation.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: