10937

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

Akihiko Kasagi, Koji Nakano, and Yasuaki Ito
Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Japan
International Conference on Parallel Processing (ICPP), 2013

@inproceedings{kasagi2013optimal,

   title={An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation},

   author={Kasagi, Akihiko and Nakano, Koji and Ito, Yasuaki},

   booktitle={Proc. of International Conference on Parallel Processing},

   year={2013}

}

Download Download (PDF)   View View   Source Source   

2267

views

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]]<-a[i] for all i in parallel, where an array p stores the permutation P. This conventional algorithm simply performs three rounds of memory access for reading from a, reading from p, and writing in b. The main contribution of this paper is to present an optimal offline permutation algorithm running in O(n/w+L) time units using n threads on the HMM with width w and latency L. We also implement our optimal offline permutation algorithm on GeForce GTX-680 GPU and evaluate the performance. Quite surprisingly, our optimal offline permutation algorithm achieves better performance than the conventional algorithm in most permutations, although it performs 32 rounds of memory access. For example, the bit-reversal permutation for 4M float (32-bit) numbers can be completed in 780ms by our optimal permutation algorithm, while the conventional algorithm takes 2328ms. We can say that the experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: