Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
ETH Zurich
European Conference on Computer Systems (EuroSys), 2024
@article{strati2024orion,
title={Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications},
author={Strati, Foteini and Ma, Xianzhe and Klimovic, Ana},
year={2024}
}
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of 𝜇s. We propose Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases. Orion significantly improves tail latency compared to state-of-the-art baselines for a high-priority inference job while collocating best-effort inference jobs to increase per-GPU request throughput by up to 7.3x, or while collocating DNN training, saving up to 1.49x in training costs compared to dedicated GPU allocation.
January 14, 2024 by hgpu