{"id":30518,"date":"2026-02-02T00:31:41","date_gmt":"2026-02-01T22:31:41","guid":{"rendered":"https:\/\/hgpu.org\/?p=30518"},"modified":"2026-02-02T22:36:16","modified_gmt":"2026-02-02T20:36:16","slug":"profinfer-an-ebpf-based-fine-grained-llm-inference-profiler","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=30518","title":{"rendered":"ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler"},"content":{"rendered":"<p>As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today&#8217;s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions &#8212; is this workload memory-bound or compute-bound? &#8212; often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama-cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers &#8212; without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today&#8217;s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions &#8212; is this workload [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,90,3],"tags":[1782,2155,1793,67],"class_list":["post-30518","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-opencl","category-paper","tag-computer-science","tag-llm","tag-opencl","tag-performance"],"views":1130,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/30518","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=30518"}],"version-history":[{"count":1,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/30518\/revisions"}],"predecessor-version":[{"id":30523,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/30518\/revisions\/30523"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=30518"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=30518"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=30518"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}