Multireduce and Multiscan on Modern GPUs

Marco Eilers
Department of Computer Science, University of Copenhagen
University of Copenhagen


   title={Multireduce and Multiscan on Modern GPUs},

   author={Eilers, Marco},



Download Download (PDF)   View View   Source Source   



With the introduction of platforms like CUDA and OpenCL, the superior computing power of modern GPUs compared to CPUs is used more and more often to accelerate general purpose computations. Data parallel primitives like reduce, scan or sort can be used as simple, deterministic building blocks for parallel algorithms, hiding the complexity of the underlying GPU implementation. The multireduce and multiscan, generalizations of the ordinary reduce and scan, have immediate applications for common algorithmic problems and show potential to be useful primitives as well. This thesis aims to find efficient GPU algorithms for both of theses problems, which currently do not exist. In order to establish a baseline CPU implementation to which GPU algorithms can be compared, we first discuss sequential algorithms for both multireduce and multiscan. We point out performance problems of sequential algorithms for both operations which stem from poor use of various caches, and show that huge pages and partial radix sorting can be used to avoid these problems. We find that the potential improvement which can be achieved through sorting differs wildly between different CPUs, and provide a simulator to estimate the benefit of this method for any given system. For GPUs, we systematically evaluate possible algorithms for both problems. We examine three main groups of algorithms: parallel adaptations of the sequential algorithm, GPU adaptations of existing PRAM algorithms, and sort-based conversions to simpler problems, namely segmented scan and reduce. For each of these possibilities, we discuss how the GPU memory hierarchy can be used for best performance, and which additional algorithmic improvements can be implemented for operators which are commutative as well as associative. For the multireduce, we propose an algorithm which, despite its generality, performs at least as well as the best published histogramming algorithm for all inputs and provides a 18x speedup over the CPU algorithm for small numbers of buckets. We also propose an algorithm based on sorting and segmented reduction which, in contrast to the previous algorithm, can also be used for non-commutative operators, and whose performance is on par with reduce-by-key implementations of current libraries. For large numbers of buckets, the range of available algorithms is smaller and the speedup compared to the CPU implementation can shrink to a factor of three. Like for CPUs, we establish that partial sorting can lead to better cache hit rates and better overall performance under certain circumstances. Subsequently, we apply the insights gained in the design of multireduce algorithms to closely related problems, namely scattering and histogramming. In the latter case, the superior work distribution of our fastest multireduce algorithm results in a 40% speedup over the best currently available histogram algorithm, independently of the input data, making our solution the fastest existing histogram algorithm for GPUs. A similar discussion is presented for possible multiscan algorithms. We observe that the only existing work-efficient PRAM algorithm for the multiscan is intrinsically unsuited for execution on GPUs and leads to poor performance. While a moderate speedup compared to the sequential algorithm is possible, the overall performance of multiscan is significantly worse than that of multireduce algorithms. We therefore conclude that the multiscan cannot be recommended as a general building block for GPU algorithms.
Rating: 2.0. From 1 vote.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: