high performance computing on graphics processing units: hgpu.org

Posts

Jun, 8

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations […]

Jun, 8

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

Performing binary search on a sorted dense array is a widely used baseline when benchmarking sophisticated index structures: It is simple, fast to build, and indexes the dataset with minimal memory footprint. However, the popular opinion is that it cannot compete with sophisticated indexes in terms of lookup performance, and hence, should not actually be […]

CUDA

Jun, 8

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby follow different weak memory consistency models. These features and properties make concurrent programming on GPUs more complex and error-prone. To this end, we present […]

CUDA

•

OpenCL

May, 25

Exploring SYCL for batched kernels with memory allocations

Batched kernels with memory allocations is a common pattern in HPC, appearing in multi-dimensional FFTs, neural networks processing, or split computation of numerical operators. Its efficient support is especially complex on GPU where memory per work-item is limited and dynamic memory allocations are challenging. This study investigates whether the native abstractions of SYCL can support […]

CUDA

May, 25

Performance of Confidential Computing GPUs

This work examines latency, throughput, and other metrics when performing inference on confidential GPUs. We explore different traffic patterns and scheduling strategies using a single Virtual Machine with one NVIDIA H100 GPU, to perform relaxed batch inferences on multiple Large Language Models (LLMs), operating under the constraint of swapping models in and out of memory, […]

CUDA

May, 25

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA<->HIP) and assembly-level (Nvidia SASS<->AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific […]

CUDA

•

OpenCL

May, 25

FLASH: Fast All-to-All Communication in GPU Clusters

Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretically optimal transfers. Classical, simple […]

May, 25

Low-cost edge computing using upcycled smartphones

Smartphone users often replace their devices prematurely for newer models, contributing to the growing issue of waste electrical and electronic equipment (WEEE). Repurposing these devices to extend their life cycle by assigning them new roles can help mitigate this problem. This thesis explores the feasibility of creating a cluster using upcycled smartphones deployed with the […]

May, 18

Comparing Parallel Functional Array Languages: Programming and Performance

Parallel functional array languages are an emerging class of programming languages that promise to combine low-effort parallel programming with good performance and performance portability. We systematically compare the designs and implementations of five different functional array languages: Accelerate, APL, DaCe, Futhark, and SaC. We demonstrate the expressiveness of functional array programming by means of four […]

CUDA

•

OpenCL

May, 18

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Graph embeddings provide continuous vector representations of nodes in a graph, which are widely applicable in community detection, recommendations, and various scientific fields. However, existing graph embedding systems either face scalability challenges due to the high cost of RAM and multiple GPUs, or rely on disk storage at the expense of I/O efficiency. In this […]

CUDA

May, 18

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware — an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a […]

CUDA

May, 18

GPU Performance Portability needs Autotuning

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

Exploring SYCL for batched kernels with memory allocations

Performance of Confidential Computing GPUs

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

FLASH: Fast All-to-All Communication in GPU Clusters

Low-cost edge computing using upcycled smartphones

Comparing Parallel Functional Array Languages: Programming and Performance

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

GPU Performance Portability needs Autotuning

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)