high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms

A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms

Xulin Zhou, Hongbin Zhang, Mingjie Xing

Institute of Software, Chinese Academy of Sciences, Beijing, China

@article{zhou2026human,

title={A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms},

author={Zhou, Xulin and Zhang, Hongbin and Xing, Mingjie},

years={2026}

}

Download (PDF)

View

Source

604

views

Single Instruction, Multiple Data (SIMD) technology enhances performance through parallel data processing on CPUs. SIMD platforms are widely adopted across domains ranging from high-performance computing to AI inference. As modern AI workloads increasingly rely on Python-based kernel frameworks to maintain usability and benefit from automatic tuning, Triton has emerged as a representative solution. However, Triton’s autotuning mechanism, designed primarily for NVIDIA GPUs, fails to effectively exploit the architectural features of SIMD CPUs, creating a significant performance gap on these platforms. To address this problem, we introduce a human–machine collaborative design tailored for Triton kernel tuning on SIMD platforms. This design improves both development efficiency and performance by capturing high-level SIMD optimization intent from human users and integrating it seamlessly into machine framework tuning. Based on this collaborative design, we develop a tuning framework composed of a front-end for user intent recognition and a back-end for user-guided, SIMD-aware tuning. Experiments on x86 and RISC-V platforms show an average performance improvement of 31.7% over native Triton tuning, with tuning cost reduced by up to 75.0%.

Tags: Auto-Tuning, Computer science, Evolutionary Computations, Triton

May 3, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms

Your response

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)