Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

James Gleeson
Department of Computer Science, University of Toronto
Department of Computer Science, University of Toronto


   title={Managing, Profiling, and Optimizing Heterogeneous GPU Workloads},

   author={Gleeson, James},



The popularity of machine learning (ML) workloads have made GPU instance offerings ubiquitous in the cloud, introducing new challenges in managing, profiling, and optimizing GPU workloads. Cloud providers assign passthrough GPUs directly to virtual machines (VMs) for high performance, but doing so renders VM migration non-functional, limiting cloud operator ability to manage hardware resources. Existing general purpose GPU (GPGPU) and deep neural network (DNN) profiling tools are ineffective for heterogeneous CPU/GPU workloads like reinforcement learning (RL) since they only provide information about GPU kernel and DNN layer execution, and ignore CPU-side bottlenecks such as simulation. The lack of adequate profiling tools has led ML researchers to rely on naive costly cluster scale-up solutions to optimize RL training time, which can cost millions of dollars and are inaccessible to most ML researchers. In this dissertation, we build systems software for addressing these challenges. For management, we build Crane, a GPU virtualization middleware that achieves within 5% of passthrough GPU performance, requires no OS/application/hypervisor modifications, and can even enable migration between heterogeneous GPU targets. For profiling, we build RL-Scope, a cross-stack profiling tool tailored to RL workloads that breaks down low-level CPU/GPU training time scoped to high-level algorithmic operations (i.e., inference, simulation, backpropagation). We survey RL workloads across major workload dimensions (i.e., simulator, RL algorithm, DNN framework) and demonstrate that RL workloads suffer universally from low GPU utilization, and that na ̈ıve attempts to increase GPU utilization by parallelizing GPU inference requests are unsuccessful. For optimization, we propose two optimizations targeting the time-consuming data collection phase of RL training. First, GPU vectorization moves simulation from the CPU to the GPU to benefit from increased hardware parallelism, achieving a 1024× speedup over CPU implementations. Second, Simulator kernel fusion fuses multiple steps of simulation into a single GPU kernel launch to benefit from caching simulator state in fast GPU registers, obtaining a 11.3× speedup over an unfused GPU kernel. Both optimizations are orthogonal and can be combined for massive multiplicative speedups, and are more accessible to ML researchers since they do not rely on costly cluster scale-up.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2023 hgpu.org

All rights belong to the respective authors

Contact us: