high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Sorting On A Graphics Processing Unit (GPU)

Sorting On A Graphics Processing Unit (GPU)

Shibdas Bandyopadhyay, Sartaj Sahni

University of Florida

Chapter in the book "Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications", Chapman-Hall/CRC Press, 2013

BibTeX

Download (PDF)

View

Source

2252

views

One of the very first GPU sorting algorithms, an adaptation of bitonic sort, was developed by Govindraju et al. [12]. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. Zachmann et al. [13] improved on this sort algorithm by using BitonicT rees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. [7] have adapted quick sort for GPUs. Their adaptation first partitions the sequence to be sorted into subsequences, sorts these subsequences in parallel, and then merges the sorted subsequences in parallel. A hybrid sort algorithm that splits the data using bucket sort and then merges the data using a vectorized version of merge sort is proposed by Sintron et al. [28]. Satish et al. [26] have developed an even faster merge sort. In this merge sort, two sorted sequences A and B are merged by a thread block to produce the sequence C when A and B have less than 256 elements each. Each thread reads an element of A and then does a binary search on the sequence B with that element to determine where it should be placed in the merged sequence C. When the number of elements in a sequence is more than 256, A and B are divided into a set of subsequences by using a set of splitters. The splitters are chosen from the two sequences in such a way that the interval between successive splitters is small enough to be merged by a thread block. The fastest GPU merge sort algorithm known at this time is Warpsort [31]. Warpsort first creates sorted sequences using bitonic sort; each sorted sequence being created by a thread warp. The sorted sequences are merged in pairs until only a small number of sequences remain. The remaining sequences are partitioned into subsequences that can be pairwise merged independently and finally this pairwise merging is done with each warp merging a pair of subsequences. Experimental results reported in [31] indicate that Warpsort is about 30% faster than the merge sort algorithm of [26]. Another comparison-based sort for GPUs-GPU sample sort-was developed by Leischner et al. [20]. Sample sort is reported to be about 30% faster than the merge sort of [26], on average, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [26].

Tags: Algorithms, Computer science, CUDA, nVidia, Sorting, Tesla C1060

June 18, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Sorting On A Graphics Processing Unit (GPU)

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)

Sorting On A Graphics Processing Unit (GPU)

Share this:

Recent source codes

Most viewed papers (last 30 days)