high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Towards Efficient and Practical GPU Multitasking in the Era of LLM

Towards Efficient and Practical GPU Multitasking in the Era of LLM

Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur-Eyal Sela, Yang Zhou, Joseph Gonzalez, Ion Stoica

UC Berkeley

arXiv:2508.08448 [cs.OS], (11 Aug 2025)

DOI:10.48550/arXiv.2508.08448

@misc{xing2025efficientpracticalgpumultitasking,

title={Towards Efficient and Practical GPU Multitasking in the Era of LLM},

author={Jiarong Xing and Yifan Qiao and Simon Mo and Xingqi Cui and Gur-Eyal Sela and Yang Zhou and Joseph Gonzalez and Ion Stoica},

year={2025},

eprint={2508.08448},

archivePrefix={arXiv},

primaryClass={cs.OS},

url={https://arxiv.org/abs/2508.08448}

}

Download (PDF)

View

Source

Source codes

Package:

kvcached: Elastic KV cache for dynamic GPU sharing and efficient multi-LLM inference

9083

views

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet the demands of modern AI workloads. In this work, we highlight the key requirements for GPU multitasking, examine prior efforts, and discuss why they fall short. To advance toward efficient and practical GPU multitasking, we envision a resource management layer, analogous to a CPU operating system, to handle various aspects of GPU resource management and sharing. We outline the challenges and potential solutions, and hope this paper inspires broader community efforts to build the next-generation GPU compute paradigm grounded in multitasking.

Tags: Computer science, CUDA, LLM, nVidia, nVidia A100, Package, Performance, PyTorch

August 24, 2025 by hgpu

No votes yet.

Please wait...