Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

hgpu.org » Applications » Computer science » Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

Ping Xiang, Yi Yang, Huiyang Zhou

Dept. of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, USA

20th International Symposium on High Performance Computer Architecture (HPCA’14), 2014

BibTeX

Download (PDF)

View

Source

3055

views

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming multiprocessor (SM). In this paper, we highlight that such TB-level resource management can severely affect the TLP that may be achieved in the hardware. First, different warps in a TB may finish at different times, which we refer to as "warp-level divergence". Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. Second, TB-level management can lead to resource fragmentation. For example, the maximum number of threads to run on an SM in an NVIDIA GTX 480 GPU is 1536. For an application with a TB containing 1024 threads, only 1 TB can run on the SM even though it has sufficient resource for a few hundreds more threads. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources. We present our lightweight architectural support for our proposed warp-level resource management. The experimental results show that our approach achieves up to 76.0% and an average of 16.0% performance gains and up to 21.7% and an average of 6.7% energy savings at minor hardware overhead.

Tags: Computer science, CUDA, GPGPU-sim, nVidia, nVidia GeForce GTX 480, Performance

January 12, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org