high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Fast in-place sorting with CUDA based on bitonic sort

Fast in-place sorting with CUDA based on bitonic sort

Hagen Peters, Ole Schulz-Hildebrandt, and Norbert Luttenberger

Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany

Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, 2010, Volume 6067/2010, 403-410, Proceedings of the 8th international conference on Parallel processing and applied mathematics, PPAM’09: Part I

DOI:10.1007/978-3-642-14390-8_42

BibTeX

Download (PDF)

View

Source

6291

views

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is well-investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit to the characteristics of modern GPU-architecture. We present a high-performance in-place implementation of Batcher’s bitonic sorting networks for CUDA-enabled GPUs. We adapted bitonic sort for arbitrary input length and assigned compare/exchange-operations to threads in a way that decreases low-performance global-memory access and thereby greatly increases the performance of the implementation.

Tags: Algorithms, Computer science, CUDA, nVidia, nVidia GeForce GTX 280, Sorting

March 24, 2011 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Fast in-place sorting with CUDA based on bitonic sort

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Fast in-place sorting with CUDA based on bitonic sort

Share this:

Recent source codes

Most viewed papers (last 30 days)