high performance computing on graphics processing units: hgpu.org

Posts

Jan, 29

Pulsar search acceleration using FPGAs and OpenCL templates

The Square Kilometre Array (SKA) is the world’s largest radio telescope currently under construction, and will employ elaborate signal processing to detect new pulsars, i.e. highly magnetised rotating neutron stars. This paper addresses the acceleration of demanding computations for this pulsar search on Field-Programmable Gate Arrays (FPGAs) using a new high-level design process based on […]

OpenCL

Jan, 29

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern […]

CUDA

Jan, 29

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of […]

Jan, 29

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL

Motion Estimation is one of the main tasks behind any video encoder. It is a computationally costly task; therefore, it is usually delegated to specific or reconfigurable hardware, such as FPGAs. Over the years, multiple FPGA implementations have been developed, mainly using hardware description languages such as Verilog or VHDL. Since programming using hardware description […]

OpenCL

Jan, 29

Fast Merge Tree Computation via SYCL

A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. […]

Jan, 22

Efficient OpenCL system integration of non-blocking FPGA accelerators

OpenCL functions as a portability layer for diverse heterogeneous hardware platforms including CPUs, GPUs, FPGAs, and hardware accelerators. However, OpenCL programs utilizing multiple of these devices in the same computing platform suffer from poor coordination between OpenCL implementations of different hardware vendors. This paper proposes a vendor-independent open source method for integrating custom FPGA accelerators […]

OpenCL

Jan, 22

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the necessary scaling of compute power. Training such large-scale models requires an intricate combination of data-, operator-, and pipeline parallelism in complex distributed systems. We show how to use OneFlow’s Split, Broadcast, and Partial Sum (SBP) tensor formulations to enable new distributed training methods […]

CUDA

Jan, 22

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks

Relational graph neural networks (RGNNs) are graph neural networks (GNNs) with dedicated structures for modeling the different types of nodes and/or edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges due to their inherent computation patterns, gap […]

CUDA

Jan, 22

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource […]

Jan, 22

PySAGES: flexible, advanced sampling methods accelerated with GPUs

Molecular dynamics simulations are a core element of research in physics, chemistry and biology. A key aspect for extending the capability of simulation tools is providing access to advanced sampling methods and techniques that permit calculation of the relevant, underlying free energy landscapes. In this sense, software tools that can be seamlessly adapted to a […]

CUDA

Jan, 15

A Programming Model for GPU Load Balancing

We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU’s potential on irregular problems has been to workload-balance through application-specific, tightly coupled […]

CUDA

Jan, 15

Improving the scalability of modern applications by parallel multi-core and many-core programming

In recent years, the production and usage of vast graphs from different disciplines—social networks, geographical navigation, and internet routing to name a few—has required fast and scalable algorithms. Reachability, single source shortest path, partitioning, and coloring are some of the problems that are commonly applied to graphs. In this thesis, we focus on the problem […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Pulsar search acceleration using FPGAs and OpenCL templates

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL

Fast Merge Tree Computation via SYCL

Efficient OpenCL system integration of non-blocking FPGA accelerators

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

PySAGES: flexible, advanced sampling methods accelerated with GPUs

A Programming Model for GPU Load Balancing

Improving the scalability of modern applications by parallel multi-core and many-core programming

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)