Speeding Up Reinforcement Learning with Graphics Processing Units

Jorn H. Postma
Department of BioMechanical Engineering, Faculty of Mechanical, Maritime and Materials Engineering (3mE), Delft University of Technology
Delft University of Technology, 2015


   title={Speeding Up Reinforcement Learning with Graphics Processing Units},

   author={Postma, Jorn H},


   school={TU Delft, Delft University of Technology}


Download Download (PDF)   View View   Source Source   



Conventionally programmed systems (e.g. robots) are not able to adapt to unforeseen changes in their task or environment. Reinforcement learning (RL), a machine learning approach, could grant this flexibility. Many fields of work could greatly benefit from this, be it in terms of cost, time or some other parameter. With RL, a learning agent tries to maximize its obtained reward during its interaction with a (maybe partially) observable environment. When the environment or even the task changes, the agent notices this and will change its behavior in order to keep its reward maximized. However, in most practical cases with large, if not continuous state and action spaces, converging towards a decent behavioral policy takes too much time to be of real use. Parallelizing RL algorithms might solve this problem. Whereas a modern multi-core central processing unit (CPU) has only a handful of cores, a graphic processing unit (GPU) has hundreds. The goal of this report is to show that fitted Q iteration (FQI), a tree-based RL method, can achieve significant speedups by parallelizing it on a GPU. The GPU was invented to speed up the generation of images, as this is a process requiring raw computational power rather than flexibility granted by the large memory caches as found on a CPU. A large part of the CPU’s caches was therefore replaced by computing cores. As a consequence, memory communications on a GPU are relatively slow and can greatly limit program performance. Speedups with respect to (multi-core) CPU applications can only be achieved if the application applies repetitive computations to many independent data elements. There should be far more computational instructions than memory transfers. To reduce memory latency, the data can simply be distributed over the on-chip memory (distribution), in tiles if necessary (tiling), or it can be streamed through it in a predetermined way (streaming). Furthermore, multiple cores should be able to use the data of one global memory transaction. Sequential and parallel implementations of FQI’s KD-Trees and Extra-Trees treebuilding methods were made using OpenCL and tested using the Puddle World task on an NVIDIA C2075 GPU. KD-Trees has excellent parallelization potential and adequate learning performance, whereas Extra-Trees has excellent learning performance but is more difficult to parallelize. Correspondingly, KD-Trees achieved speedups exceeding 100 times, while Extra-Trees achieved speedups of around 20 times. KDTrees could furthermore solve much larger problems, achieved greater speedups at small problems and was less memory intensive. Despite this and the fact that learning times with KD-Trees were hundreds of times smaller than those of Extra-Trees, KD-Trees needed many more samples to find optimal solutions. The choice between KD-Trees and Extra-Trees thus comes down to the nature of the problem: if few samples are available Extra-Trees is the better choice, but with more samples and a more time-critical task KD-Trees would be preferred. Future research could further optimize the parallel implementations (e.g. by combining multiple parallelization strategies). Applications of the implementations in the real world could be researched. Also, the parallelization potential of other RL algorithms could be investigated.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: