Program Optimization of Stencil Based Application on the GPU-Accelerated System

hgpu.org » Applications » Computer science » Program Optimization of Stencil Based Application on the GPU-Accelerated System

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Guibin Wang, Xuejun Yang, Ying Zhang, Tao Tang, XuDong Fang

Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China

IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009

DOI:10.1109/ISPA.2009.70

BibTeX

Source

1546

views

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2 K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2 x speedup compared to the simple mapping version, and get as high as 34.3 x speedup when compared with a CPU implementation.

Tags: Computer science, Optimization

April 7, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Share this:

Recent source codes

Most viewed papers (last 30 days)