high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Sorting On A Graphics Processing Unit (GPU)

Sorting On A Graphics Processing Unit (GPU)

Shibdas Bandyopadhyay, Sartaj Sahni

University of Florida

Chapter in the book "Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications", Chapman-Hall/CRC Press, 2013

@article{bandyopadhyay2013sorting,

title={Sorting On A Graphics Processing Unit (GPU)},

author={Bandyopadhyay, Shibdas and Sahni, Sartaj},

year={2013}

}

Download (PDF)

View

Source

1862

views

One of the very first GPU sorting algorithms, an adaptation of bitonic sort, was developed by Govindraju et al. [12]. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. Zachmann et al. [13] improved on this sort algorithm by using BitonicT rees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. [7] have adapted quick sort for GPUs. Their adaptation first partitions the sequence to be sorted into subsequences, sorts these subsequences in parallel, and then merges the sorted subsequences in parallel. A hybrid sort algorithm that splits the data using bucket sort and then merges the data using a vectorized version of merge sort is proposed by Sintron et al. [28]. Satish et al. [26] have developed an even faster merge sort. In this merge sort, two sorted sequences A and B are merged by a thread block to produce the sequence C when A and B have less than 256 elements each. Each thread reads an element of A and then does a binary search on the sequence B with that element to determine where it should be placed in the merged sequence C. When the number of elements in a sequence is more than 256, A and B are divided into a set of subsequences by using a set of splitters. The splitters are chosen from the two sequences in such a way that the interval between successive splitters is small enough to be merged by a thread block. The fastest GPU merge sort algorithm known at this time is Warpsort [31]. Warpsort first creates sorted sequences using bitonic sort; each sorted sequence being created by a thread warp. The sorted sequences are merged in pairs until only a small number of sequences remain. The remaining sequences are partitioned into subsequences that can be pairwise merged independently and finally this pairwise merging is done with each warp merging a pair of subsequences. Experimental results reported in [31] indicate that Warpsort is about 30% faster than the merge sort algorithm of [26]. Another comparison-based sort for GPUs-GPU sample sort-was developed by Leischner et al. [20]. Sample sort is reported to be about 30% faster than the merge sort of [26], on average, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [26].

Tags: Algorithms, Computer science, CUDA, nVidia, Sorting, Tesla C1060

June 18, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Sorting On A Graphics Processing Unit (GPU)

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Sorting On A Graphics Processing Unit (GPU)

Share this:

Recent source codes

Most viewed papers (last 30 days)