high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Speeding Up Reinforcement Learning with Graphics Processing Units

Speeding Up Reinforcement Learning with Graphics Processing Units

Jorn H. Postma

Department of BioMechanical Engineering, Faculty of Mechanical, Maritime and Materials Engineering (3mE), Delft University of Technology

Delft University of Technology, 2015

@phdthesis{postma2015speeding,

title={Speeding Up Reinforcement Learning with Graphics Processing Units},

author={Postma, Jorn H},

year={2015},

school={TU Delft, Delft University of Technology}

}

Download (PDF)

View

Source

3365

views

Conventionally programmed systems (e.g. robots) are not able to adapt to unforeseen changes in their task or environment. Reinforcement learning (RL), a machine learning approach, could grant this flexibility. Many fields of work could greatly benefit from this, be it in terms of cost, time or some other parameter. With RL, a learning agent tries to maximize its obtained reward during its interaction with a (maybe partially) observable environment. When the environment or even the task changes, the agent notices this and will change its behavior in order to keep its reward maximized. However, in most practical cases with large, if not continuous state and action spaces, converging towards a decent behavioral policy takes too much time to be of real use. Parallelizing RL algorithms might solve this problem. Whereas a modern multi-core central processing unit (CPU) has only a handful of cores, a graphic processing unit (GPU) has hundreds. The goal of this report is to show that fitted Q iteration (FQI), a tree-based RL method, can achieve significant speedups by parallelizing it on a GPU. The GPU was invented to speed up the generation of images, as this is a process requiring raw computational power rather than flexibility granted by the large memory caches as found on a CPU. A large part of the CPU’s caches was therefore replaced by computing cores. As a consequence, memory communications on a GPU are relatively slow and can greatly limit program performance. Speedups with respect to (multi-core) CPU applications can only be achieved if the application applies repetitive computations to many independent data elements. There should be far more computational instructions than memory transfers. To reduce memory latency, the data can simply be distributed over the on-chip memory (distribution), in tiles if necessary (tiling), or it can be streamed through it in a predetermined way (streaming). Furthermore, multiple cores should be able to use the data of one global memory transaction. Sequential and parallel implementations of FQI’s KD-Trees and Extra-Trees treebuilding methods were made using OpenCL and tested using the Puddle World task on an NVIDIA C2075 GPU. KD-Trees has excellent parallelization potential and adequate learning performance, whereas Extra-Trees has excellent learning performance but is more difficult to parallelize. Correspondingly, KD-Trees achieved speedups exceeding 100 times, while Extra-Trees achieved speedups of around 20 times. KDTrees could furthermore solve much larger problems, achieved greater speedups at small problems and was less memory intensive. Despite this and the fact that learning times with KD-Trees were hundreds of times smaller than those of Extra-Trees, KD-Trees needed many more samples to find optimal solutions. The choice between KD-Trees and Extra-Trees thus comes down to the nature of the problem: if few samples are available Extra-Trees is the better choice, but with more samples and a more time-critical task KD-Trees would be preferred. Future research could further optimize the parallel implementations (e.g. by combining multiple parallelization strategies). Applications of the implementations in the real world could be researched. Also, the parallelization potential of other RL algorithms could be investigated.

Tags: Algorithms, Computer science, KD-tree, Machine learning, nVidia, OpenCL, Tesla C2075, Thesis

February 18, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Speeding Up Reinforcement Learning with Graphics Processing Units

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Speeding Up Reinforcement Learning with Graphics Processing Units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)