high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Sina Heidari, Dimitrios S. Nikolopoulos

Virginia Tech, Blacksburg, Virginia, USA

arXiv:2604.26666 [cs.DC], (29 Apr 2026)

DOI:10.48550/arXiv.2604.26666

@misc{heidari2026fact,

title={FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow},

author={Sina Heidari and Dimitrios S. Nikolopoulos},

year={2026},

eprint={2604.26666},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2604.26666}

}

Download (PDF)

View

Source

550

views

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a framework that employs a three-stage, agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. (1) Pattern discovery: an LLM agent inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples from an architecture-specific index, and outputs prioritized patterns. (2) Pattern realization: each pattern is implemented as a CUTLASS kernel wrapped in a PyTorch extension, verified, and auto-tuned by sweeping parameters inferred from the CUTLASS hierarchy. (3) Pattern composition: extensions are loaded together into a single composed module for end-to-end benchmarking. We evaluate the workflow using KernelBench’s evaluation framework and provided modules on an NVIDIA A100. On Level 1, we apply the workflow to three GEMM workloads (square matrix multiply, batched matrix multiply, and large-K matrix multiply). Auto-tuned CUTLASS kernels improve over PyTorch cuBLAS baseline by 1.06x-1.18x. On Level 3 MiniGPT block, composing fused multi-head attention with fused MLP GEMM+GELU yields 2.79x end-to-end speedup. Our work couples agentic graph-level pattern discovery with auto-tuning and a dynamic pattern table, offering a practical path from traced PyTorch to deployable kernels by automating CUTLASS kernel synthesis and auto-tuning.

Tags: Computer science, CUBLAS, CUDA, Deep learning, nVidia, nVidia A100

May 3, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Your response

Recent source codes

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

Most viewed papers (last 30 days)

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)