high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Kazuya Tani, Daisuke Takafuji, Koji Nakano, Yasuaki Ito

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527 Japan

International Parallel and Distributed Processing Symposium Workshops, pp. 586-595, 2014

BibTeX

Download (PDF)

View

Source

1848

views

The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. The main contribution of this paper is to show that the bulk execution of an oblivious sequential algorithm can be implemented to run on the UMM very efficiently. More specifically, the bulk execution for p different inputs can be implemented to run O(pt/w+lt) time units using p threads on the UMM with memory width w and memory access latency l, where t is the running time of the oblivious sequential algorithm. We also prove that this implementation is time optimal. Further, we have implemented two oblivious sequential algorithms to compute the prefix-sums of an array of size n and to find the optimal triangulation of a convex n-gon using the dynamic programming technique. The prefix-sum algorithm is a quite simple example of oblivious algorithms, while the optimal triangulation algorithm is rather complicated. The experimental results on GeForce GTX Titan show that our implementations for the bulk execution of these two algorithms can be 150 times faster than that of a single CPU if they have many inputs. This fact implies that our idea for the bulk execution of oblivious sequential algorithms is a potent method to elicit the capability of CUDA-enabled GPUs very easily.

Tags: Algorithms, Computer science, CUDA, Memory model, nVidia, nVidia GeForce GTX 680, nVidia GeForce GTX Titan, Programming techniques

May 31, 2014 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Share this:

Recent source codes

Most viewed papers (last 30 days)