Program Optimization of Stencil Based Application on the GPU-Accelerated System
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009
@conference{wang2009program,
title={Program optimization of stencil based application on the gpu-accelerated system},
author={Wang, G. and Yang, X. and Zhang, Y. and Tang, T. and Fang, X.D.},
booktitle={2009 IEEE International Symposium on Parallel and Distributed Processing with Applications},
pages={219–225},
year={2009},
organization={IEEE}
}
Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2 K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2 x speedup compared to the simple mapping version, and get as high as 34.3 x speedup when compared with a CPU implementation.
April 7, 2011 by hgpu