high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

James Gleeson

Department of Computer Science, University of Toronto

@phdthesis{gleeson2023managing,

title={Managing, Profiling, and Optimizing Heterogeneous GPU Workloads},

author={Gleeson, James},

year={2023}

}

Download (PDF)

View

Source

Source codes

Package:

RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

1171

views

The popularity of machine learning (ML) workloads have made GPU instance offerings ubiquitous in the cloud, introducing new challenges in managing, profiling, and optimizing GPU workloads. Cloud providers assign passthrough GPUs directly to virtual machines (VMs) for high performance, but doing so renders VM migration non-functional, limiting cloud operator ability to manage hardware resources. Existing general purpose GPU (GPGPU) and deep neural network (DNN) profiling tools are ineffective for heterogeneous CPU/GPU workloads like reinforcement learning (RL) since they only provide information about GPU kernel and DNN layer execution, and ignore CPU-side bottlenecks such as simulation. The lack of adequate profiling tools has led ML researchers to rely on naive costly cluster scale-up solutions to optimize RL training time, which can cost millions of dollars and are inaccessible to most ML researchers. In this dissertation, we build systems software for addressing these challenges. For management, we build Crane, a GPU virtualization middleware that achieves within 5% of passthrough GPU performance, requires no OS/application/hypervisor modifications, and can even enable migration between heterogeneous GPU targets. For profiling, we build RL-Scope, a cross-stack profiling tool tailored to RL workloads that breaks down low-level CPU/GPU training time scoped to high-level algorithmic operations (i.e., inference, simulation, backpropagation). We survey RL workloads across major workload dimensions (i.e., simulator, RL algorithm, DNN framework) and demonstrate that RL workloads suffer universally from low GPU utilization, and that na ̈ıve attempts to increase GPU utilization by parallelizing GPU inference requests are unsuccessful. For optimization, we propose two optimizations targeting the time-consuming data collection phase of RL training. First, GPU vectorization moves simulation from the CPU to the GPU to benefit from increased hardware parallelism, achieving a 1024× speedup over CPU implementations. Second, Simulator kernel fusion fuses multiple steps of simulation into a single GPU kernel launch to benefit from caching simulator state in fast GPU registers, obtaining a 11.3× speedup over an unfused GPU kernel. Both optimizations are orthogonal and can be combined for massive multiplicative speedups, and are more accessible to ML researchers since they do not rely on costly cluster scale-up.

Tags: Computer science, CUDA, Heterogeneous systems, Neural networks, nVidia, nVidia GeForce GTX 480, nVidia GeForce RTX 2080 Ti, nVidia Grid K1, Optimization, Package, Performance, Tesla V100, Thesis, Virtualization

July 2, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Package:

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)