Posts
Nov, 9
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP code for GPUs. First, CUDA code can be compiled by either NVCC or Clang for NVIDIA GPUs. Alternatively, AMD’s recently introduced HIP platform makes porting from CUDA […]
Nov, 9
RDMA Point-to-Point Communication for LLM Systems
Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs […]
Nov, 9
AMD MI300X GPU Performance Analysis
The rapid growth of large language models (LLMs) has driven the need for high-performance, scalable GPU hardware capable of efficiently serving models with hundreds of billions of parameters. While NVIDIA GPUs have traditionally dominated LLM deployments due to their mature CUDA software stack and state-of the-art accelerators, AMD’s latest MI300X GPUs offer a compelling alternative, […]
Nov, 2
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. Our approach co-locates integrity verification directly with large ML model execution on GPU accelerators, resolving the fundamental mismatch between how large ML workloads typically run (primarily on GPUs) and how security […]
Nov, 2
Serve Programs, Not Prompts
Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to […]
Nov, 2
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To address these challenges, recent advances in compiler technology and hardware accelerators have introduced new opportunities for performance portability. In this work, we evaluate JAX and TVM as high-level frameworks that […]
Nov, 2
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can be viewed as the execution of a series of DL operators, which are essential components that perform the core numerical computations. Therefore, […]
Nov, 2
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper […]
Oct, 26
Collective Communication for 100k+ GPUs
The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed […]
Oct, 26
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototypes directly within existing C++ applications and automatically transform them into deployable AIE graph projects. It thereby eliminates the need to manually separate host and accelerator codebases, […]
Oct, 26
STARK: Strategic Team of Agents for Refining Kernels
The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as […]
Oct, 26
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-performance computing (HPC) applications remains limited. This paper presents a methodology for effectively integrating TCs into a representative HPC application: molecular docking with AutoDockGPU. The irregular computational patterns and […]

