Posts
Nov, 9
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In […]
Nov, 9
RDMA Point-to-Point Communication for LLM Systems
Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs […]
Nov, 9
AMD MI300X GPU Performance Analysis
The rapid growth of large language models (LLMs) has driven the need for high-performance, scalable GPU hardware capable of efficiently serving models with hundreds of billions of parameters. While NVIDIA GPUs have traditionally dominated LLM deployments due to their mature CUDA software stack and state-of the-art accelerators, AMD’s latest MI300X GPUs offer a compelling alternative, […]
Nov, 9
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping […]
Nov, 9
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP code for GPUs. First, CUDA code can be compiled by either NVCC or Clang for NVIDIA GPUs. Alternatively, AMD’s recently introduced HIP platform makes porting from CUDA […]
Nov, 2
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. Our approach co-locates integrity verification directly with large ML model execution on GPU accelerators, resolving the fundamental mismatch between how large ML workloads typically run (primarily on GPUs) and how security […]
Nov, 2
Serve Programs, Not Prompts
Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to […]
Nov, 2
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To address these challenges, recent advances in compiler technology and hardware accelerators have introduced new opportunities for performance portability. In this work, we evaluate JAX and TVM as high-level frameworks that […]
Nov, 2
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can be viewed as the execution of a series of DL operators, which are essential components that perform the core numerical computations. Therefore, […]
Nov, 2
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper […]
Oct, 26
Collective Communication for 100k+ GPUs
The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed […]
Oct, 26
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototypes directly within existing C++ applications and automatically transform them into deployable AIE graph projects. It thereby eliminates the need to manually separate host and accelerator codebases, […]

