## Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Japan

International Parallel and Distributed Processing Symposium Workshops, pp. 586-595, 2014

@article{tani2014bulk,

title={Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation},

author={Tani, Kazuya and Takafuji, Daisuke and Nakano, Koji and Ito, Yasuaki},

year={2014}

}

The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. The main contribution of this paper is to show that the bulk execution of an oblivious sequential algorithm can be implemented to run on the UMM very efficiently. More specifically, the bulk execution for p different inputs can be implemented to run O(pt/w+lt) time units using p threads on the UMM with memory width w and memory access latency l, where t is the running time of the oblivious sequential algorithm. We also prove that this implementation is time optimal. Further, we have implemented two oblivious sequential algorithms to compute the prefix-sums of an array of size n and to find the optimal triangulation of a convex n-gon using the dynamic programming technique. The prefix-sum algorithm is a quite simple example of oblivious algorithms, while the optimal triangulation algorithm is rather complicated. The experimental results on GeForce GTX Titan show that our implementations for the bulk execution of these two algorithms can be 150 times faster than that of a single CPU if they have many inputs. This fact implies that our idea for the bulk execution of oblivious sequential algorithms is a potent method to elicit the capability of CUDA-enabled GPUs very easily.

May 31, 2014 by hgpu