PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

hgpu.org » Applications » Computer science » PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Jan Solanti, Michal Babej, Julius Ikkala, Vinod Kumar Malamal Vadakital, Pekka Jääskeläinen

Faculty of Information Technology and Communication Sciences (ITC), Tampere University, Tampere, Finland

International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XXI), 2021

BibTeX

Download (PDF)

View

Source

Source codes

Package:

POCL: Portable Computing Language

1622

views

Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of remote heterogeneous computing resources is needed. To this end, we propose a latency-optimized scalable distributed heterogeneous computing runtime implementing the standard OpenCL API. In the proposed runtime, network-induced latency is reduced by means of peer-to-peer data transfers and event synchronization as well as a streamlined control protocol implementation. Further improvements can be obtained streaming of source data directly from the producer device to the compute cluster. Compute cluster scalability is improved by distributing the command and event processing responsibilities to remote compute servers. We also show how a simple optional dynamic content size buffer OpenCL extension can significantly speed up applications that utilize variable length data. For evaluation we present a smartphone-based augmented reality rendering case study which, using the runtime, receives 19x improvement in frames per second and 17x improvement in energy per frame when offloading parts of the rendering workload to a nearby GPU server. The remote kernel execution latency overhead of the runtime is only 60 microseconds on top of the network roundtrip time. The scalability on multi-server multi-GPU clusters is shown with a distributed large matrix multiplication application.

Tags: Computer science, GPU cluster, Heterogeneous systems, Matrix multiplication, nVidia, nVidia GeForce GTX 1060, nVidia GeForce GTX 2080 Ti, OpenCL, Package, Rendering, Tesla P100, Tesla V100

August 8, 2021 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org