An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

hgpu.org » Programming » Algorithms » An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

Akihiko Kasagi, Koji Nakano, and Yasuaki Ito

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Japan

International Conference on Parallel Processing (ICPP), 2013

BibTeX

Download (PDF)

View

Source

2806

views

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]]<-a[i] for all i in parallel, where an array p stores the permutation P. This conventional algorithm simply performs three rounds of memory access for reading from a, reading from p, and writing in b. The main contribution of this paper is to present an optimal offline permutation algorithm running in O(n/w+L) time units using n threads on the HMM with width w and latency L. We also implement our optimal offline permutation algorithm on GeForce GTX-680 GPU and evaluate the performance. Quite surprisingly, our optimal offline permutation algorithm achieves better performance than the conventional algorithm in most permutations, although it performs 32 rounds of memory access. For example, the bit-reversal permutation for 4M float (32-bit) numbers can be completed in 780ms by our optimal permutation algorithm, while the conventional algorithm takes 2328ms. We can say that the experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.

Tags: Algorithms, Computer science, CUDA, Memory, nVidia, nVidia GeForce GTX 680

November 22, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org