Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Sundaresan Venkatasubramanian
Georgia Institute of Technology
Georgia Institute of Technology, 2009


   title={Tuned and asynchronous stencil kernels for CPU/GPU systems},

   author={Venkatasubramanian, S.},


   publisher={Georgia Institute of Technology}


Download Download (PDF)   View View   Source Source   



We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi’s iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider “wildly asynchronous” implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly “fast-and-loose” algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: