high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Hagen Peters, Ole Schulz-Hildebrandt, Norbert Luttenberger

Research Group for Communication Systems, Department of Computer Science, Christian-Albrechts-University Kiel, Germany

In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) (April 2010), pp. 1-8.

DOI:10.1109/IPDPSW.2010.5470833

BibTeX

Download (PDF)

View

Source

4724

views

Sorting is a well-investigated topic in Computer Science in general and by now many efficient sorting algorithms for CPUs and GPUs have been developed. There is no swapping, paging, etc. available on GPUs to provide more virtual memory than physically available, thus if one wants to sort sequences that exceed GPU memory using the GPU the problem of external sorting arises. In this contribution we present a novel merge-based external sorting algorithm for one or more CUDA-enabled GPUs. We reduce the performance impact of memory transfers to and from the GPU by using an approach similar to regular samplesort and by overlapping memory transfers with GPU computation. We achieve a good utilization of GPUs and load balancing among them by carefully choosing the samples and the amount of GPU memory used for computation. We demonstrate the performance of our algorithm by extended testing. Using two GTX280 the implementation outperforms the fastest CPU sorting algorithms known to the authors.

Tags: Algorithms, Computer science, CUDA, nVidia, nVidia GeForce GTX 260, nVidia GeForce GTX 280, Sorting

November 2, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)