high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Thread Block Compaction for Efficient SIMT Control Flow

Thread Block Compaction for Efficient SIMT Control Flow

Wilson W. L. Fung, Tor M. Aamodt

University of British Columbia, Vancouver, BC, Canada

17th IEEE International Symposium on High-Performance Computer Architecture, HPCA-17, 2011

DOI:10.1109/HPCA.2011.5749714

@article{fung2011thread,

title={Thread Block Compaction for Efficient SIMT Control Flow},

author={Fung, Wilson W. L. and Aamodt, Tor M.},

booktitle={17th IEEE International Symposium on High-Performance Computer Architecture, HPCA-17},

year={2011}

}

Download (PDF)

View

Source

1936

views

Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data "cores" to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common blockwide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22% over a baseline perwarp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence.

Tags: Computer science, CUDA, nVidia, nVidia Quadro FX 5800, Performance, Programming techniques

September 28, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Thread Block Compaction for Efficient SIMT Control Flow

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Thread Block Compaction for Efficient SIMT Control Flow

Share this:

Recent source codes

Most viewed papers (last 30 days)