high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A comprehensive study of Dynamic Memory Management in OpenCL kernels

A comprehensive study of Dynamic Memory Management in OpenCL kernels

Roy Spliet

Faculty of Electrical Engineering, Mathematics and Computer Science, Department of Software and Computer Technology, Parallel and Distributed Systems Group, Delft University of Technology

Delft University of Technology, 2013

@article{spliet2013comprehensive,

title={A comprehensive study of Dynamic Memory Management in OpenCL kernels},

author={Spliet, Roy},

year={2013}

}

Download (PDF)

View

Source

2793

views

Traditional (sequential) applications use malloc for a variety of dynamic data structures, like linked lists or trees. GPGPU is gaining attention and popularity because its massively-parallel architecture allows for great speed improvement for programs that can be parallelised and implemented for a platform like OpenCL. Programmers who try to port their existing sequential or even parallel program to OpenCL however will soon discover that this standard defines a subset of C with several limitations, one of which is the absence of a malloc() routine that can be called from an OpenCL kernel. This document describes the results of research towards the impact of this limitation by trying to answer the question: "How should a kernel-side heap allocator be implemented in OpenCL?". To give an answer to this question, two important aspects are investigated. Firstly we seek the impact from the programmers perspective by trying to understand the use-cases for which a parallel program requires an in-kernel heap allocator. We sought for a wide variety of implementations of parallel algorithms and investigated their memory usage patterns. Secondly we aim to explain the complexity of a dynamic memory allocator by taking a closer look at the limitations imposed by hardware and the OpenCL standard, and trying to find a way to implement a memory allocator in OpenCL. Based on our findings we present KMA, the first dynamic memory allocator for OpenCL kernels. Results from our use-case survey show that there are several cases where a programmer could benefit from an in-kernel malloc routine. Non-regular data-structures are required both for scheduling purposes and for holding intermediate results, and a memory allocator aids in creating these structures. Programmers usually find their way around the lack of a malloc routine, sometimes by creating something on their own that resembles part of one. Although a malloc routine thus is not a hard requirement, it can make the life of programmers a lot easier in these cases. Designing and implementing a memory allocator for OpenCL is far from trivial. Its capabilities are limited by the available hardware and language constraints: hardware does not allow for communication between the OpenCL kernel and the host system, and the OpenCL specification lacks fundamental features like mutex locks. Because of these constraints, even the data-structures we might take for granted, like full-featured doubly-linked-lists, cannot be implemented in OpenCL. We believe a memory allocator should be implemented in two layers. The bottom layer should implement behaviour like the malloc() routine found in C. On top of that data types can be implemented which may use datatype-specific restrictions to optimise the usage of the low-level memory allocator. Our KMA prototype shows that by carefully analysing the limitations of the OpenCL environment it is possible to implement an in-kernel memory allocator, be it limited by the constraints imposed by the GPU platform and OpenCL. Although using KMA comes with a performance penalty, it can offer the flexibility and code readability for the use-cases we identified, and offers the right level of support for optimisations if the use-case permits.

Tags: Algorithms, ATI, Computer science, nVidia, nVidia GeForce GT 640, nVidia GeForce GTX 480, nVidia GeForce GTX 680, OpenCL, Performance, Tesla C2050, Tesla K20, Thesis

June 6, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

A comprehensive study of Dynamic Memory Management in OpenCL kernels

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

A comprehensive study of Dynamic Memory Management in OpenCL kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)