Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

hgpu.org » Programming » Algorithms » Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data (2010), pp. 351-362.

DOI:10.1145/1807167.1807207

@conference{satish2010fast,

title={Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort},

author={Satish, N. and Kim, C. and Chhugani, J. and Nguyen, A.D. and Lee, V.W. and Kim, D. and Dubey, P.},

booktitle={Proceedings of the 2010 international conference on Management of data},

pages={351–362},

year={2010},

organization={ACM}

}

Download (PDF)

View

Source

6254

views

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures – the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes – such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 2^(40) (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

Tags: Algorithms, Computer science, Databases, Sorting

November 20, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org