Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

hgpu.org » Programming » Algorithms » Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Kazuya Tani, Daisuke Takafuji, Koji Nakano, Yasuaki Ito

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Japan

International Parallel and Distributed Processing Symposium Workshops, pp. 586-595, 2014

@article{tani2014bulk,

title={Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation},

author={Tani, Kazuya and Takafuji, Daisuke and Nakano, Koji and Ito, Yasuaki},

year={2014}

}

Download (PDF)

View

Source

1503

views

The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. The main contribution of this paper is to show that the bulk execution of an oblivious sequential algorithm can be implemented to run on the UMM very efficiently. More specifically, the bulk execution for p different inputs can be implemented to run O(pt/w+lt) time units using p threads on the UMM with memory width w and memory access latency l, where t is the running time of the oblivious sequential algorithm. We also prove that this implementation is time optimal. Further, we have implemented two oblivious sequential algorithms to compute the prefix-sums of an array of size n and to find the optimal triangulation of a convex n-gon using the dynamic programming technique. The prefix-sum algorithm is a quite simple example of oblivious algorithms, while the optimal triangulation algorithm is rather complicated. The experimental results on GeForce GTX Titan show that our implementations for the bulk execution of these two algorithms can be 150 times faster than that of a single CPU if they have many inputs. This fact implies that our idea for the bulk execution of oblivious sequential algorithms is a potent method to elicit the capability of CUDA-enabled GPUs very easily.

Tags: Algorithms, Computer science, CUDA, Memory model, nVidia, nVidia GeForce GTX 680, nVidia GeForce GTX Titan, Programming techniques

May 31, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org