high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Data transfer optimizations for heterogeneous managed runtime systems

Data transfer optimizations for heterogeneous managed runtime systems

Florin-Gabriel Blanaru

The University of Manchester

The University of Manchester, 2022

BibTeX

Download (PDF)

View

Source

1283

views

Nowadays, most programmable systems contain multiple hardware accelerators with different characteristics. In order to use the available hardware resources and improve the performance of their applications, developers must use a low-level language, such as C/C++. Succeeding the same goal from a high-level managed language (Java, Haskell, C#) poses several challenges such as the inability to perform asynchronous data transfers and declare pinned memory. Therefore, managed languages have not established the path of hardware acceleration yet. Recently, frameworks that run on top of managed runtime systems have been developed, enabling acceleration of high-level programming languages on heterogeneous hardware. In this project, one particular aspect of hardware acceleration in the context of managed runtimes is analyzed, namely memory transfers between the host and the device. Two different solutions for improvement are proposed. The first solution enhances TornadoVM, a heterogeneous managed runtime system, to allow for pinned off-heap buffers allocation and batch processing that overlaps computation with data transfers. A performance increase in data transfers of up to 50% is obtained when pinned memory is used. Additionally, up to 2.5x in end to end performance speed up can be achieved over sequential batches, when pinned memory is combined with parallel batching. The second solution extends MaxineVM to allocate its heap through the CUDA Unified Memory, allowing for Java objects resident in the heap to be accessed by the GPU. A performance increase of up to 134x end to end and a garbage collection slowdown of 2.45x compared against sequential Java execution is obtained.

Tags: Computer science, CUDA, Heterogeneous systems, Java, nVidia, nVidia GeForce GTX 1650, nVidia Quadro GP100, OpenCL

March 27, 2022 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Data transfer optimizations for heterogeneous managed runtime systems

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Data transfer optimizations for heterogeneous managed runtime systems

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)