Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

hgpu.org » Applications » Computer science » Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

E. Z. Zhang, Y. Jiang, G. Ziyu, X. Shen

Computer Science Department, The College of William and Mary, Williamsburg, VA, USA 23185

Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, 2010, Pages: 115-125

DOI:10.1145/1810085.1810104

@conference{zhang2010streamlining,

title={Streamlining gpu applications on the fly},

author={Zhang, E.Z. and Jiang, Y. and Guo, Z. and Shen, X.},

booktitle={Proceedings of the 24th ACM International Conference on Supercomputing},

pages={115–126},

year={2010},

organization={ICS}

}

Download (PDF)

View

Source

1554

views

Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops. This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.

Tags: Compilers, Computer science, CUDA, nVidia, Performance, Programming Languages, Tesla C1060

November 5, 2010 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org