high performance computing on graphics processing units: hgpu.org

Posts

Jun, 15

GPU Acceleration of SQL Analytics on Compressed Data

GPUs are uniquely suited to accelerate (SQL) analytics workloads thanks to their massive compute parallelism and High Bandwidth Memory (HBM) — when datasets fit in the GPU HBM, performance is unparalleled. Unfortunately, GPU HBMs remain typically small when compared with lower-bandwidth CPU main memory. Besides brute-force scaling across many GPUs, current solutions to accelerate queries […]

CUDA

Jun, 15

Enabling Profile Guided Optimizations (PGO) for Graphics

This master thesis presents an implementation for enabling profile-guided optimizations (PGO) for mobile phone GPUs. PGO is an optimization technique that uses runtime profiling data, like block frequency and function call frequency, to guide compiler optimizations. The implementation is done by adapting the existing PGO infrastructure in LLVM to serve the architectural differences between CPUs […]

Jun, 15

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs […]

CUDA

Jun, 8

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

As multicore vector processors improve in computational and memory performance, running SIMT (Single Instruction Multiple Threads) programs on CPUs has become increasingly appealing, potentially eliminating the need for dedicated GPU hardware. SYCL is a royalty-free cross-platform C++ programming model for heterogeneous computing that implements the SIMT model and provides a path to run GPU programs […]

OpenCL

Jun, 8

Acceleration as a Service (XaaS) Source Containers

In this thesis, we address the challenge of performance portability in heterogeneous computing environments. Performance portability refers to the ability of an application to maintain high performance on multiple platforms without requiring extensive manual tuning for each system. Traditional containers fall short in this regard as they prioritize portability at the expense of architecture-specific optimizations. […]

Jun, 8

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations […]

Jun, 8

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

Performing binary search on a sorted dense array is a widely used baseline when benchmarking sophisticated index structures: It is simple, fast to build, and indexes the dataset with minimal memory footprint. However, the popular opinion is that it cannot compete with sophisticated indexes in terms of lookup performance, and hence, should not actually be […]

CUDA

Jun, 8

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby follow different weak memory consistency models. These features and properties make concurrent programming on GPUs more complex and error-prone. To this end, we present […]

CUDA

•

OpenCL

May, 25

Exploring SYCL for batched kernels with memory allocations

Batched kernels with memory allocations is a common pattern in HPC, appearing in multi-dimensional FFTs, neural networks processing, or split computation of numerical operators. Its efficient support is especially complex on GPU where memory per work-item is limited and dynamic memory allocations are challenging. This study investigates whether the native abstractions of SYCL can support […]

CUDA

May, 25

Performance of Confidential Computing GPUs

This work examines latency, throughput, and other metrics when performing inference on confidential GPUs. We explore different traffic patterns and scheduling strategies using a single Virtual Machine with one NVIDIA H100 GPU, to perform relaxed batch inferences on multiple Large Language Models (LLMs), operating under the constraint of swapping models in and out of memory, […]

CUDA

May, 25

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA<->HIP) and assembly-level (Nvidia SASS<->AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific […]

CUDA

•

OpenCL

May, 25

FLASH: Fast All-to-All Communication in GPU Clusters

Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretically optimal transfers. Classical, simple […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU Acceleration of SQL Analytics on Compressed Data

Enabling Profile Guided Optimizations (PGO) for Graphics

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

Acceleration as a Service (XaaS) Source Containers

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

Exploring SYCL for batched kernels with memory allocations

Performance of Confidential Computing GPUs

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

FLASH: Fast All-to-All Communication in GPU Clusters

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)