high performance computing on graphics processing units: hgpu.org

Posts

Nov, 2

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. Our approach co-locates integrity verification directly with large ML model execution on GPU accelerators, resolving the fundamental mismatch between how large ML workloads typically run (primarily on GPUs) and how security […]

Nov, 2

Serve Programs, Not Prompts

Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to […]

Nov, 2

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To address these challenges, recent advances in compiler technology and hardware accelerators have introduced new opportunities for performance portability. In this work, we evaluate JAX and TVM as high-level frameworks that […]

CUDA

Nov, 2

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can be viewed as the execution of a series of DL operators, which are essential components that perform the core numerical computations. Therefore, […]

CUDA

Nov, 2

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper […]

CUDA

Oct, 26

Collective Communication for 100k+ GPUs

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed […]

CUDA

Oct, 26

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototypes directly within existing C++ applications and automatically transform them into deployable AIE graph projects. It thereby eliminates the need to manually separate host and accelerator codebases, […]

Oct, 26

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as […]

Oct, 26

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-performance computing (HPC) applications remains limited. This paper presents a methodology for effectively integrating TCs into a representative HPC application: molecular docking with AutoDockGPU. The irregular computational patterns and […]

CUDA

Oct, 26

Tutoring LLM into a Better CUDA Optimizer

Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the most recent reasoning models to generate optimized CUDA code for predefined, well-known tasks. Our objective is to determine […]

CUDA

Oct, 19

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribute the runtime adaptivity framework that we have implemented in AdaptiveCpp. This framework can automatically generate specialized kernels at JIT-time, automatically taking into account various information about the kernel invocation, […]

CUDA

Oct, 19

Compiler and Runtime Systems for Generative AI Models

Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central factors: (1) GenAI workloads are intrinsically dynamic—featuring variable sequence lengths and irregular sparsity patterns—and (2) they evolve at a rapid pace, with shifting model architectures and changing deployment […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

Serve Programs, Not Prompts

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Collective Communication for 100k+ GPUs

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

STARK: Strategic Team of Agents for Refining Kernels

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Compiler and Runtime Systems for Generative AI Models

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)