27865

Posts

Feb, 5

Revisiting Query Performance in GPU Database Systems

GPUs offer massive compute parallelism and high-bandwidth memory accesses. GPU database systems seek to exploit those capabilities to accelerate data analytics. Although modern GPUs have more resources (e.g., higher DRAM bandwidth) than ever before, judicious choices for query processing that avoid wasteful resource allocations are still advantageous. Database systems can save GPU runtime costs through […]
Feb, 5

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU’s parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from […]
Jan, 29

Pulsar search acceleration using FPGAs and OpenCL templates

The Square Kilometre Array (SKA) is the world’s largest radio telescope currently under construction, and will employ elaborate signal processing to detect new pulsars, i.e. highly magnetised rotating neutron stars. This paper addresses the acceleration of demanding computations for this pulsar search on Field-Programmable Gate Arrays (FPGAs) using a new high-level design process based on […]
Jan, 29

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern […]
Jan, 29

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of […]
Jan, 29

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL

Motion Estimation is one of the main tasks behind any video encoder. It is a computationally costly task; therefore, it is usually delegated to specific or reconfigurable hardware, such as FPGAs. Over the years, multiple FPGA implementations have been developed, mainly using hardware description languages such as Verilog or VHDL. Since programming using hardware description […]
Jan, 29

Fast Merge Tree Computation via SYCL

A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. […]
Jan, 22

Efficient OpenCL system integration of non-blocking FPGA accelerators

OpenCL functions as a portability layer for diverse heterogeneous hardware platforms including CPUs, GPUs, FPGAs, and hardware accelerators. However, OpenCL programs utilizing multiple of these devices in the same computing platform suffer from poor coordination between OpenCL implementations of different hardware vendors. This paper proposes a vendor-independent open source method for integrating custom FPGA accelerators […]
Jan, 22

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource […]
Jan, 22

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the necessary scaling of compute power. Training such large-scale models requires an intricate combination of data-, operator-, and pipeline parallelism in complex distributed systems. We show how to use OneFlow’s Split, Broadcast, and Partial Sum (SBP) tensor formulations to enable new distributed training methods […]
Jan, 22

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks

Relational graph neural networks (RGNNs) are graph neural networks (GNNs) with dedicated structures for modeling the different types of nodes and/or edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges due to their inherent computation patterns, gap […]
Jan, 22

PySAGES: flexible, advanced sampling methods accelerated with GPUs

Molecular dynamics simulations are a core element of research in physics, chemistry and biology. A key aspect for extending the capability of simulation tools is providing access to advanced sampling methods and techniques that permit calculation of the relevant, underlying free energy landscapes. In this sense, software tools that can be seamlessly adapted to a […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: