Fine-Grained Synchronizations and Dataflow Programming on GPUs
Eindhoven University of Technology, Eindhoven, Netherlands
International Conference on Supercomputing (ICS), 2015
@article{li2015fine,
title={Fine-Grained Synchronizations and Dataflow Programming on GPUs},
author={Li, Ang and van den Braak, Gert-Jan and Corporaal, Henk and Kumar, Akash},
year={2015}
}
The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and fine-grained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the data-flow algorithms. In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained data-flow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch – a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).
May 3, 2015 by hgpu