30305

Compiler and Runtime Systems for Generative AI Models

Zihao Ye
University of Washington
University of Washington, 2025

@phdthesis{ye2025compiler,

   title={Compiler and Runtime Systems for Generative AI Models},

   author={Ye, Zihao},

   year={2025},

   school={University of Washington}

}

Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central factors: (1) GenAI workloads are intrinsically dynamic—featuring variable sequence lengths and irregular sparsity patterns—and (2) they evolve at a rapid pace, with shifting model architectures and changing deployment requirements. This dissertation addresses these challenges through a co-design approach spanning both compiler and runtime layers, presenting two complementary systems that collectively enable efficient GenAI acceleration.SparseTIR is a tensor compiler specifically designed for sparse deep learning workloads. While sparsity is pervasive in GenAI models, developing high-performance sparse GPU kernels remains difficult due to heterogeneous sparsity patterns and their unique optimization requirements. SparseTIR introduces composable abstractions for both data formats and scheduling transformations, enabling complex optimization strategies with significantly reduced code complexity. It achieves performance competitive with hand-optimized libraries while improving modularity and developer productivity.FlashInfer is a fast and adaptable attention engine tailored for large language model (LLM) inference. As attention increasingly dominates computational costs in modern GenAI models, scalable and customizable GPU kernels become essential. FlashInfer supports block-sparse KV-cache layouts, Just-In-Time (JIT) compilation of parameterized attention templates, and dynamic load-balancing mechanisms compatible with CUDA Graphs. Building on this foundation, we are developing megakernels for low-latency inference and multiplexing inference scenarios. As an open-source project, FlashInfer has pioneered LLM inference kernel development, being among the first to explore techniques like split-KV, GQA packing, and cascade inference. It has been deployed at scale in production environments and fostered a vibrant community across academia and industry. These systems form a cohesive framework for accelerating GenAI workloads through integrated compiler-runtime co-design. They demonstrate how principled systems approaches can achieve both high performance and adaptability in response to rapidly evolving machine learning demands, providing a foundation for future GenAI system development.
No votes yet.
Please wait...

You must be logged in to post a comment.

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: