Weak execution ordering – exploiting iterative methods on many-core GPUs

hgpu.org » Applications » Computer science » Computer vision » Weak execution ordering – exploiting iterative methods on many-core GPUs

Weak execution ordering – exploiting iterative methods on many-core GPUs

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir, Jeff Ho, Lu Peng

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL

In 2010 IEEE International Symposium on Performance Analysis of Systems & Software, ISSPAS 2010 (March 2010), pp. 154-163

DOI:10.1109/ISPASS.2010.5452028

BibTeX

Download (PDF)

View

Source

1865

views

On NVIDIA’s many-core GPUs, there is no synchronization function among parallel thread blocks. When fine-granularity of data communication and synchronization is required for large-scale parallel programs executed by multiple thread blocks, frequent host synchronization are necessary, and they incur a significant overhead. In this paper, we investigate a class of applications which uses a chaotic version of iterative methods [5], [22] to obtain numerical solutions for partial differential equations (PDE). Such a fast PDE solver is parallelized on GPUs with multiple thread blocks. In this parallel implementation, although frequent data communication is needed between adjacent thread blocks, a precise order of the data communication is not necessary. Separate communication threads are used for periodically exchanging the boundary values with adjacent thread blocks through the global memory. Since a precise order of the data communication is not required, the computation and the communication threads can be overlapped to alleviate the communication overhead. Performance measurements of two popular applications, Poisson image editing from computer graphics and shape from shading from computer vision, on Tesla C1060 show that a speedup of 4-5 times is achievable for both applications in comparison with the solution using host synchronization.

Tags: Computer vision, CUDA, Image processing, nVidia, Partial differential equations, PDEs, Tesla C1060

January 26, 2011 by hgpu

No votes yet.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

Weak execution ordering – exploiting iterative methods on many-core GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)

Weak execution ordering – exploiting iterative methods on many-core GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)