GPU computing architecture for irregular parallelism
The University of British Columbia, Vancouver
The University of British Columbia, 2015
@article{fung2015gpu,
title={GPU computing architecture for irregular parallelism},
author={Fung, Wilson Wai Lun},
year={2015},
publisher={University of British Columbia}
}
Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer and an uncertain amount of performance/efficiency benefit. One known challenge in developing GPU applications with irregular parallelism is the underutilization of SIMD hardware in GPUs due to the application’s irregular control flow behavior, known as branch divergence. Another major development effort is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks. The GPU software developers may need to spend significant effort verifying the data synchronization mechanisms used in their applications. Despite various research studies indicating the potential benefits, the risks involved may discourage software developers from employing GPUs for this class of applications. This dissertation aims to reduce the burden on GPU software developers with two major enhancements to GPU architectures. First, thread block compaction (TBC) is a microarchitecture innovation that reduces the performance penalty caused by branch divergence in GPU applications. Our evaluations show that TBC provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism on a set of GPU applications that suffer significantly from branch divergence. Second, Kilo TM is a cost effective, energy efficient solution for supporting transactional memory (TM) on GPUs. With TM, programmers can uses transactions instead of fine-grained locks to create deadlock-free, maintainable, yet aggressively-parallelized code. In our evaluations, Kilo TM achieves 192X speedup over coarse-grained locking and captures 66% of the performance of fine-grained locking with 34% energy overhead.
January 26, 2015 by hgpu