high performance computing on graphics processing units: hgpu.org

Posts

Feb, 4

Gallatin: A General-Purpose GPU Memory Manager

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory […]

CUDA

Feb, 4

Deductive verification for SYCL

A heterogeneous computing system is a system composed of different types of computing units. SYCL is a software development framework with which programs can be developed for such systems. It uses the concept of kernels, where a kernel executes code inside it in parallel, and different kernels can be executed concurrently on multiple computing units. […]

CUDA

•

OpenCL

Feb, 4

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built […]

OpenCL

Feb, 4

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

The cryosphere plays a significant role in Earth’s climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making […]

CUDA

Feb, 4

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations

We present a highly-optimized thread-safe lattice Boltzmann model in which the non-equilibrium part of the distribution function is locally reconstructed via recursivity of Hermite polynomials. Such a procedure allows the explicit incorporation of non-equilibrium moments of the distribution up to the order supported by the lattice. Thus, the proposed approach increases accuracy and stability at […]

CUDA

Jan, 28

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this […]

CUDA

Jan, 28

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing […]

CUDA

•

OpenCL

Jan, 28

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Next generation High-Energy Physics (HEP) experiments are presented with significant computational challenges, both in terms of data volume and processing power. Using compute accelerators, such as GPUs, is one of the promising ways to provide the necessary computational power to meet the challenge. The current programming models for compute accelerators often involve using architecture-specific programming […]

Jan, 28

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature, on large datasets. The growing volume and dimensionality of data necessitates designing […]

CUDA

Jan, 28

A Heterogeneous Inference Framework for a Deep Neural Network

Artificial intelligence (AI) is one of the most promising technologies based on machine learning algorithms. In this paper, we propose a workflow for the implementation of deep neural networks. This workflow attempts to combine the flexibility of high-level compilers (HLS)-based networks with the architectural control features of hardware description languages (HDL)-based flows. The architecture consists […]

OpenCL

Jan, 21

A Survey on Hardware Accelerators for Large Language Models

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. As the demand for more sophisticated LLMs continues to grow, there is a pressing need to address the computational challenges associated with their scale and complexity. This paper presents […]

Jan, 21

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer

Since specific hardware characteristics and low-level programming model are adapted to both NVIDIA GPU and new generation Sunway architecture, automatically translating mature CUDA kernels to Sunway ATHREAD kernels are realistic but challenging work. To address this issue, swCUDA, an auto parallel code translation framework is proposed. To that end, we create scale afne translation to […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Gallatin: A General-Purpose GPU Memory Manager

Deductive verification for SYCL

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

A Heterogeneous Inference Framework for a Deep Neural Network

A Survey on Hardware Accelerators for Large Language Models

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)