high performance computing on graphics processing units: hgpu.org

Posts

Feb, 12

Training DNN Models over Heterogeneous Clusters with Optimal Performance

Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters, existing adaptive DNN training techniques solely consider homogeneous environments. Optimizing distributed DNN training over heterogeneous clusters is technically challenging, and directly adapting existing techniques results in low utilization and poor performance. To […]

CUDA

Feb, 12

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Virtual screening is an early stage in the drug discovery process that selects the most promising candidates. In the urgent computing scenario, finding a solution in the shortest time frame is critical. Any improvement in the performance of a virtual screening application translates into an increase in the number of candidates evaluated, thereby raising the […]

CUDA

Feb, 4

Gallatin: A General-Purpose GPU Memory Manager

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory […]

CUDA

Feb, 4

Deductive verification for SYCL

A heterogeneous computing system is a system composed of different types of computing units. SYCL is a software development framework with which programs can be developed for such systems. It uses the concept of kernels, where a kernel executes code inside it in parallel, and different kernels can be executed concurrently on multiple computing units. […]

CUDA

•

OpenCL

Feb, 4

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built […]

OpenCL

Feb, 4

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

The cryosphere plays a significant role in Earth’s climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making […]

CUDA

Feb, 4

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations

We present a highly-optimized thread-safe lattice Boltzmann model in which the non-equilibrium part of the distribution function is locally reconstructed via recursivity of Hermite polynomials. Such a procedure allows the explicit incorporation of non-equilibrium moments of the distribution up to the order supported by the lattice. Thus, the proposed approach increases accuracy and stability at […]

CUDA

Jan, 28

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this […]

CUDA

Jan, 28

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

The world’s largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing […]

CUDA

•

OpenCL

Jan, 28

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Next generation High-Energy Physics (HEP) experiments are presented with significant computational challenges, both in terms of data volume and processing power. Using compute accelerators, such as GPUs, is one of the promising ways to provide the necessary computational power to meet the challenge. The current programming models for compute accelerators often involve using architecture-specific programming […]

Jan, 28

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature, on large datasets. The growing volume and dimensionality of data necessitates designing […]

CUDA

Jan, 28

A Heterogeneous Inference Framework for a Deep Neural Network

Artificial intelligence (AI) is one of the most promising technologies based on machine learning algorithms. In this paper, we propose a workflow for the implementation of deep neural networks. This workflow attempts to combine the flexibility of high-level compilers (HLS)-based networks with the architectural control features of hardware description languages (HDL)-based flows. The architecture consists […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Training DNN Models over Heterogeneous Clusters with Optimal Performance

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Gallatin: A General-Purpose GPU Memory Manager

Deductive verification for SYCL

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

A Heterogeneous Inference Framework for a Deep Neural Network

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)