ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler
Huawei Hilbert Research Center (Dresden), Dresden, Germany
arXiv:2601.20755 [cs.SE], (28 Jan 2026 (v1)
@misc{zou2026profinfer,
title={ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler},
author={Bohua Zou and Debayan Roy and Dhimankumar Yogesh Airao and Weihao Xu and Binqi Sun and Yutao Liu and Haibo Chen},
year={2026},
eprint={2601.20755},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.20755}
}
As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today's LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions — is this workload memory-bound or compute-bound? — often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama-cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers — without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.
February 2, 2026 by hgpu
Your response
You must be logged in to post a comment.




