high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Multireduce and Multiscan on Modern GPUs

Multireduce and Multiscan on Modern GPUs

Marco Eilers

Department of Computer Science, University of Copenhagen

University of Copenhagen

BibTeX

Download (PDF)

View

Source

1807

views

With the introduction of platforms like CUDA and OpenCL, the superior computing power of modern GPUs compared to CPUs is used more and more often to accelerate general purpose computations. Data parallel primitives like reduce, scan or sort can be used as simple, deterministic building blocks for parallel algorithms, hiding the complexity of the underlying GPU implementation. The multireduce and multiscan, generalizations of the ordinary reduce and scan, have immediate applications for common algorithmic problems and show potential to be useful primitives as well. This thesis aims to find efficient GPU algorithms for both of theses problems, which currently do not exist. In order to establish a baseline CPU implementation to which GPU algorithms can be compared, we first discuss sequential algorithms for both multireduce and multiscan. We point out performance problems of sequential algorithms for both operations which stem from poor use of various caches, and show that huge pages and partial radix sorting can be used to avoid these problems. We find that the potential improvement which can be achieved through sorting differs wildly between different CPUs, and provide a simulator to estimate the benefit of this method for any given system. For GPUs, we systematically evaluate possible algorithms for both problems. We examine three main groups of algorithms: parallel adaptations of the sequential algorithm, GPU adaptations of existing PRAM algorithms, and sort-based conversions to simpler problems, namely segmented scan and reduce. For each of these possibilities, we discuss how the GPU memory hierarchy can be used for best performance, and which additional algorithmic improvements can be implemented for operators which are commutative as well as associative. For the multireduce, we propose an algorithm which, despite its generality, performs at least as well as the best published histogramming algorithm for all inputs and provides a 18x speedup over the CPU algorithm for small numbers of buckets. We also propose an algorithm based on sorting and segmented reduction which, in contrast to the previous algorithm, can also be used for non-commutative operators, and whose performance is on par with reduce-by-key implementations of current libraries. For large numbers of buckets, the range of available algorithms is smaller and the speedup compared to the CPU implementation can shrink to a factor of three. Like for CPUs, we establish that partial sorting can lead to better cache hit rates and better overall performance under certain circumstances. Subsequently, we apply the insights gained in the design of multireduce algorithms to closely related problems, namely scattering and histogramming. In the latter case, the superior work distribution of our fastest multireduce algorithm results in a 40% speedup over the best currently available histogram algorithm, independently of the input data, making our solution the fastest existing histogram algorithm for GPUs. A similar discussion is presented for possible multiscan algorithms. We observe that the only existing work-efficient PRAM algorithm for the multiscan is intrinsically unsuited for execution on GPUs and leads to poor performance. While a moderate speedup compared to the sequential algorithm is possible, the overall performance of multiscan is significantly worse than that of multireduce algorithms. We therefore conclude that the multiscan cannot be recommended as a general building block for GPU algorithms.

Tags: Algorithms, Computer science, CUDA, nVidia, nVidia GeForce GTX 690, Sorting, Thesis

May 6, 2014 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Multireduce and Multiscan on Modern GPUs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Multireduce and Multiscan on Modern GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)