AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

hgpu.org » Applications » Computer science » AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

Zicong Ye, Kunming Zhang, Guoming Tang

The Hong Kong University of Science and Technology (Guangzhou), Guangdong, Guangzhou, China

arXiv:2508.01744 [cs.LG], (3 Aug 2025)

DOI:10.48550/arXiv.2508.01744

@misc{ye2025agftadaptivegpufrequency,

title={AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization},

author={Zicong Ye and Kunming Zhang and Guoming Tang},

year={2025},

eprint={2508.01744},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2508.01744}

}

Download (PDF)

View

Source

695

views

The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising peak performance. To address this challenge, we propose AGFT (An Adaptive GPU Frequency Tuner), a framework that employs online reinforcement learning to autonomously learn an optimal frequency tuning policy. By monitoring real-time features like request load and latency, AGFT utilizes fine-grained frequency control for precise adjustments and intelligent action space pruning for stable, efficient decision-making. This creates a robust, automated energy management solution. We comprehensively evaluated AGFT in an environment simulating realistic, fluctuating inference requests. The experimental results demonstrate that AGFT successfully saves 44.3% of GPU energy consumption while introducing a minimal performance latency overhead of under 10%. This achievement translates into a comprehensive Energy-Delay Product (EDP) optimization of up to 40.3%, clearly showing that our framework can significantly enhance the energy efficiency and economic benefits of existing LLM inference clusters without compromising service quality.

Tags: Computer science, Energy-efficient computing, LLM, nVidia, nVidia A800, nVidia RTX A6000

August 10, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org