high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » ML Inference Scheduling with Predictable Latency

ML Inference Scheduling with Predictable Latency

Haidong Zhao, Nikolaos Georgantas

Inria & Sorbonne University, Paris, France

Proceedings of the Middleware for Autonomous AIoT Systems in the Computing Continuum (MAIoT ’25), 2025

DOI:10.1145/3774901.3778066

@incollection{zhao2025ml,

title={ML Inference Scheduling with Predictable Latency},

author={Zhao, Haidong and Georgantas, Nikolaos},

booktitle={Proceedings of the Middleware for Autonomous AIoT Systems in the Computing Continuum},

pages={25–30},

year={2025}

}

Download (PDF)

View

Source

469

views

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.

Tags: Computer science, Machine learning, nVidia, nVidia L4, Task scheduling

December 21, 2025 by hgpu

No votes yet.

Please wait...