high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions

A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions

Alex Watkins, Oded Green

Georgia Institute of Technology

Irregular Applications: Architectures and Algorithms (IA^3), 2018

@article{watkins2018fast,

title={A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions},

author={Watkins, Alex and Green, Oded},

year={2018}

}

Download (PDF)

View

Source

2101

views

Merging and sorting algorithms are the backbone of many modern computer applications. As such, efficient implementations are desired. Recent architectural advancements in CPUs (Central Processing Units), such as wider and more powerful vector instructions, allow for algorithmic improvements. This paper presents a new approach to merge sort using vector instructions. Traditional approaches to vectorized sorting typically utilize a bitonic sorting network (Batcher’s Algorithm) which adds significant overhead. Our approach eliminates the overhead from this approach. We start with a branch-avoiding merge algorithm and then use the Merge Path algorithm to split up merging between the different SIMD lanes. Testing demonstrates that the algorithm not only surpasses the SIMD based bitonic counterpart, but that it is over 2.94x faster than a traditional merge, merging over 300M keys per second in one thread and over 16B keys per second in parallel. Our new sort reaches is over 5x faster than quicksort and 2x faster than Intel’s IPP library sort, sorting over 5.3M keys per second for a single processor and in parallel over 500M keys per second and a speedup of over 2x from a traditional merge sort.

Tags: Algorithms, Computer science, Intel Xeon Phi, Sorting

December 2, 2018 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

A Fast and Simple Approach to Merge and Merge Sort using Wide Vector Instructions

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)