high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Speeding Up Reinforcement Learning with Graphics Processing Units

Speeding Up Reinforcement Learning with Graphics Processing Units

Jorn H. Postma

Department of BioMechanical Engineering, Faculty of Mechanical, Maritime and Materials Engineering (3mE), Delft University of Technology

Delft University of Technology, 2015

BibTeX

Download (PDF)

View

Source

3249

views

Conventionally programmed systems (e.g. robots) are not able to adapt to unforeseen changes in their task or environment. Reinforcement learning (RL), a machine learning approach, could grant this flexibility. Many fields of work could greatly benefit from this, be it in terms of cost, time or some other parameter. With RL, a learning agent tries to maximize its obtained reward during its interaction with a (maybe partially) observable environment. When the environment or even the task changes, the agent notices this and will change its behavior in order to keep its reward maximized. However, in most practical cases with large, if not continuous state and action spaces, converging towards a decent behavioral policy takes too much time to be of real use. Parallelizing RL algorithms might solve this problem. Whereas a modern multi-core central processing unit (CPU) has only a handful of cores, a graphic processing unit (GPU) has hundreds. The goal of this report is to show that fitted Q iteration (FQI), a tree-based RL method, can achieve significant speedups by parallelizing it on a GPU. The GPU was invented to speed up the generation of images, as this is a process requiring raw computational power rather than flexibility granted by the large memory caches as found on a CPU. A large part of the CPU’s caches was therefore replaced by computing cores. As a consequence, memory communications on a GPU are relatively slow and can greatly limit program performance. Speedups with respect to (multi-core) CPU applications can only be achieved if the application applies repetitive computations to many independent data elements. There should be far more computational instructions than memory transfers. To reduce memory latency, the data can simply be distributed over the on-chip memory (distribution), in tiles if necessary (tiling), or it can be streamed through it in a predetermined way (streaming). Furthermore, multiple cores should be able to use the data of one global memory transaction. Sequential and parallel implementations of FQI’s KD-Trees and Extra-Trees treebuilding methods were made using OpenCL and tested using the Puddle World task on an NVIDIA C2075 GPU. KD-Trees has excellent parallelization potential and adequate learning performance, whereas Extra-Trees has excellent learning performance but is more difficult to parallelize. Correspondingly, KD-Trees achieved speedups exceeding 100 times, while Extra-Trees achieved speedups of around 20 times. KDTrees could furthermore solve much larger problems, achieved greater speedups at small problems and was less memory intensive. Despite this and the fact that learning times with KD-Trees were hundreds of times smaller than those of Extra-Trees, KD-Trees needed many more samples to find optimal solutions. The choice between KD-Trees and Extra-Trees thus comes down to the nature of the problem: if few samples are available Extra-Trees is the better choice, but with more samples and a more time-critical task KD-Trees would be preferred. Future research could further optimize the parallel implementations (e.g. by combining multiple parallelization strategies). Applications of the implementations in the real world could be researched. Also, the parallelization potential of other RL algorithms could be investigated.

Tags: Algorithms, Computer science, KD-tree, Machine learning, nVidia, OpenCL, Tesla C2075, Thesis

February 18, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Speeding Up Reinforcement Learning with Graphics Processing Units

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Speeding Up Reinforcement Learning with Graphics Processing Units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)