high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Akihiko Kasagi, Koji Nakano, Yasuaki Ito

Department of Information Engineering, Hiroshima University

International Conference on Parallel Processing, pp.251-250, 2014

@article{kasagi2014parallel,

title={Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations},

author={Kasagi, Akihiko and Nakano, Koji and Ito, Yasuaki},

year={2014}

}

Download (PDF)

View

Source

2594

views

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the rowwise prefix-sums. The main contribution of this paper is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), which supports asynchronous execution of CUDA blocks, and show a global-memory-access-optimal parallel algorithm for computing the SAT on the asynchronous HMM. A straightforward algorithm (2R2W SAT algorithm) on the asynchronous HMM, which computes the prefix-sums in every column using one thread each and then computes the prefix-sums in every row, performs 2 read operations and 2 write operations per element of a matrix. The previously published best algorithm (2R1W SAT algorithm) performs 2 read operations and 1 write operation per element. We present a more efficient algorithm (1R1W SAT algorithm) which performs 1 read operation and 1 write operation per element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be written, our 1R1W SAT algorithm is optimal in terms of the global memory access. We also show a combined algorithm ((1+r) R1W SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better performance. We have implemented several algorithms including 2R2W, 2R1W, 1R1W, (1+r) R1W SAT algorithms on GeForce GTX 780 Ti. The experimental results show that our (1+r) R1W SAT algorithm runs faster than any other SAT algorithms for large input matrices. Also, it runs more than 100 times faster than the best SAT algorithm using a single CPU.

Tags: Algorithms, Computer vision, CUDA, Image processing, nVidia, nVidia GeForce GTX 780 Ti, SAT

October 14, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Your response

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)