high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Efficient fine grained shared buffer management for multiple OpenCL devices

Efficient fine grained shared buffer management for multiple OpenCL devices

Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang

Computer School, National University of Defense Technology, Changsha 410073, China

Journal of Zhejiang University-SCIENCE C, 14(11), 2013

DOI:10.1631/jzus.C1300078

@article{xun2013efficient,

title={Efficient fine grained shared buffer management for multiple OpenCL devices},

author={XUN, Chang-qing and CHEN, Dong and LAN, Qiang and ZHANG, Chun-yuan},

year={2013}

}

Download (PDF)

View

Source

1708

views

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a Distributed Shared OpenCL Memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

Tags: Computer science, Heterogeneous systems, Memory model, nVidia, OpenCL, Tesla C2050

October 19, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Efficient fine grained shared buffer management for multiple OpenCL devices

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Efficient fine grained shared buffer management for multiple OpenCL devices

Share this:

Recent source codes

Most viewed papers (last 30 days)