high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita, Ali R. Butt

Virginia Tech, USA

arXiv:2512.10271 [cs.DC], (11 Dec 2025)

DOI:10.48550/arXiv.2512.10271

@misc{dongare2025hybridlearningoptimizationbaseddynamic,

title={Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters},

author={Shruti Dongare and Redwan Ibne Seraj Khan and Hadeel Albahar and Nannan Zhao and Diego Melendez Maita and Ali R. Butt},

year={2025},

eprint={2512.10271},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2512.10271}

}

Download (PDF)

View

Source

Source codes

Package:

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

871

views

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.

Tags: Computer science, Deep learning, GPU cluster, nVidia, nVidia H100, Package, Task scheduling

December 15, 2025 by hgpu

No votes yet.

Please wait...