Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Jiayuan Meng, Kevin Skadron
Department of Computer Science, University of Virginia
In ICS ’09: Proceedings of the 23rd international conference on Supercomputing (2009), pp. 256-265.


   title={Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs},

   author={Meng, J. and Skadron, K.},

   booktitle={Proceedings of the 23rd international conference on Supercomputing},





Download Download (PDF)   View View   Source Source   



Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: