high performance computing on graphics processing units: hgpu.org

Posts

Aug, 31

Scaling GPU-Accelerated Databases beyond GPU Memory Size

There has been considerable interest in leveraging GPUs’ computational power and high memory bandwidth for analytical database workloads. However, their limited memory capacity remains a fundamental limitation for databases whose sizes far exceed the GPU memory size. This challenge is exacerbated by the slow PCIe data transfer speed, that creates a bottleneck in overall system […]

Aug, 31

BePilot: An AI Programming Assistant for Compiler Backend Development

Compiler backends are tasked with generating executable machine code for various processors. As the diversity of processors continues to grow, it is imperative for programmers to tailor specific compiler backends to accommodate each one. However, compiler backend development remains a labor-intensive and time-consuming process, with limited automation tools available. Although large language models (LLMs) have […]

Aug, 24

Profiling Concurrent Vision Inference Workloads on NVIDIA Jetson – Extended

The proliferation of IoT devices and advancements in network technologies have intensified the demand for real-time data processing at the network edge. To address these demands, low-power AI accelerators, particularly GPUs, are increasingly deployed for inference tasks, enabling efficient computation while mitigating cloud-based systems’ latency and bandwidth limitations. Despite their growing deployment, GPUs remain underutilised […]

CUDA

Aug, 24

Towards Efficient and Practical GPU Multitasking in the Era of LLM

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet the demands of modern AI workloads. In this work, we highlight the key requirements for GPU multitasking, examine prior efforts, […]

CUDA

Aug, 24

Bandicoot: A Templated C++ Library for GPU Linear Algebra

We introduce the Bandicoot C++ library for linear algebra and scientific computing on GPUs, overviewing its user interface and performance characteristics, as well as the technical details of its internal design. Bandicoot is the GPU-enabled counterpart to the well-known Armadillo C++ linear algebra library, aiming to allow users to take advantage of GPU-accelerated computation for […]

CUDA

•

OpenCL

Aug, 24

Fuzz4cuda: Fuzzing Your Nvidia Gpu Libraries Through Debug Interface

The programming security of Compute Unified Device Architecture(CUDA), NVIDIA’s parallel computing platform and programming model for Graphics Processing Unit, has always been a significant concern. On the host-side, fuzzing has been remarkably successful at uncovering various software bugs and vulnerabilities, with hundreds of flaws discovered annually through different fuzzing tools. However, existing fuzzing tools typically […]

CUDA

Aug, 24

Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive

The ever-increasing compute performance of GPU accelerators drives up the need for efficient data movements within HPC applications to sustain performance. Proposed as a solution to alleviate CPU-GPU data movement, AMD MI300A Accelerated Processing Unit (APU) combines CPU, GPU, and high-bandwidth memory (HBM) within a single physical package. Leadership supercomputers, such as El Capitan, group […]

Aug, 17

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Recent deep learning compilers commonly adopt auto-tuning approaches that search for the optimal kernel configuration in tensor programming from scratch, requiring tens of hours per operation and neglecting crucial optimization factors for parallel computing on asymmetric multicore processors. Meanwhile, hand-optimized inference libraries from hardware vendors provide high performance but lack the flexibility and automation needed […]

Aug, 17

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and Vertical Fusion (VF) can mitigate this, current library implementations often require library developers to manually create fused kernels. Hence, library […]

CUDA

Aug, 17

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to […]

CUDA

Aug, 17

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it […]

OpenCL

Aug, 17

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Rowhammer is a read disturbance vulnerability in modern DRAM that causes bit-flips, compromising security and reliability. While extensively studied on Intel and AMD CPUs with DDR and LPDDR memories, its impact on GPUs using GDDR memories, critical for emerging machine learning applications, remains unexplored. Rowhammer attacks on GPUs face unique challenges: (1) proprietary mapping of […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Scaling GPU-Accelerated Databases beyond GPU Memory Size

BePilot: An AI Programming Assistant for Compiler Backend Development

Profiling Concurrent Vision Inference Workloads on NVIDIA Jetson – Extended

Towards Efficient and Practical GPU Multitasking in the Era of LLM

Bandicoot: A Templated C++ Library for GPU Linear Algebra

Fuzz4cuda: Fuzzing Your Nvidia Gpu Libraries Through Debug Interface

Inter-APU Communication on AMD MI300A Systems via Infinity Fabric: a Deep Dive

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)