30034

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, Hailong Jiang
Department of Engineering Management and Systems Engineering, George Washington University, USA
arXiv:2507.07932 [cs.DC], (10 Jul 2025)
BibTeX

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.
No votes yet.
Please wait...

You must be logged in to post a comment.

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org