High performance comparison-based sorting algorithm on many-core GPUs
Key Laboratory of Computer System and Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences, Beijing, China
IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010
@inproceedings{ye2010high,
title={High performance comparison-based sorting algorithm on many-core GPUs},
author={Ye, X. and Fan, D. and Lin, W. and Yuan, N. and Ienne, P.},
booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},
pages={1–10},
year={2010},
organization={IEEE}
}
Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.
May 30, 2011 by hgpu