high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

In SIGMOD ’10: Proceedings of the 2010 international conference on Management of data (2010), pp. 351-362.

DOI:10.1145/1807167.1807207

BibTeX

Download (PDF)

View

Source

5728

views

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures – the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes – such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 2^(40) (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

Tags: Algorithms, Computer science, Databases, Sorting

November 20, 2010 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Share this:

Recent source codes

Most viewed papers (last 30 days)