high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Foteini Strati, Xianzhe Ma, Ana Klimovic

ETH Zurich

European Conference on Computer Systems (EuroSys), 2024

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Orion: An interference-aware scheduler for fine-grained GPU sharing

1545

views

GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of 𝜇s. We propose Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases. Orion significantly improves tail latency compared to state-of-the-art baselines for a high-priority inference job while collocating best-effort inference jobs to increase per-GPU request throughput by up to 7.3x, or while collocating DNN training, saving up to 1.49x in training costs compared to dedicated GPU allocation.

Tags: Computer science, CUDA, Deep learning, Machine learning, Neural networks, nVidia, nVidia A100, nVidia V100, Package, PyTorch

January 14, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)