6689

High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing

Duane Merrill, Andrew Grimshaw
Department of Computer Science, University of Virginia, Charlottesville, Virginia 22904, USA
Parallel Processing Letters, Volume 21, Number 2, p. 245-272, 2011

@article{merrill2011high,

   title={High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing},

   author={Merrill, D. and Grimshaw, A.},

   journal={Parallel Processing Letters},

   volume={21},

   number={2},

   pages={245–272},

   year={2011}

}

Download Download (PDF)   View View   Source Source   Source codes Source codes

2289

views

The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: