27794

Posts

Jan, 29

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL

Motion Estimation is one of the main tasks behind any video encoder. It is a computationally costly task; therefore, it is usually delegated to specific or reconfigurable hardware, such as FPGAs. Over the years, multiple FPGA implementations have been developed, mainly using hardware description languages such as Verilog or VHDL. Since programming using hardware description […]
Jan, 29

Fast Merge Tree Computation via SYCL

A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. […]
Jan, 22

Efficient OpenCL system integration of non-blocking FPGA accelerators

OpenCL functions as a portability layer for diverse heterogeneous hardware platforms including CPUs, GPUs, FPGAs, and hardware accelerators. However, OpenCL programs utilizing multiple of these devices in the same computing platform suffer from poor coordination between OpenCL implementations of different hardware vendors. This paper proposes a vendor-independent open source method for integrating custom FPGA accelerators […]
Jan, 22

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks

Relational graph neural networks (RGNNs) are graph neural networks (GNNs) with dedicated structures for modeling the different types of nodes and/or edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges due to their inherent computation patterns, gap […]
Jan, 22

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource […]
Jan, 22

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the necessary scaling of compute power. Training such large-scale models requires an intricate combination of data-, operator-, and pipeline parallelism in complex distributed systems. We show how to use OneFlow’s Split, Broadcast, and Partial Sum (SBP) tensor formulations to enable new distributed training methods […]
Jan, 22

PySAGES: flexible, advanced sampling methods accelerated with GPUs

Molecular dynamics simulations are a core element of research in physics, chemistry and biology. A key aspect for extending the capability of simulation tools is providing access to advanced sampling methods and techniques that permit calculation of the relevant, underlying free energy landscapes. In this sense, software tools that can be seamlessly adapted to a […]
Jan, 15

A Programming Model for GPU Load Balancing

We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU’s potential on irregular problems has been to workload-balance through application-specific, tightly coupled […]
Jan, 15

Improving the scalability of modern applications by parallel multi-core and many-core programming

In recent years, the production and usage of vast graphs from different disciplines—social networks, geographical navigation, and internet routing to name a few—has required fast and scalable algorithms. Reachability, single source shortest path, partitioning, and coloring are some of the problems that are commonly applied to graphs. In this thesis, we focus on the problem […]
Jan, 15

Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments

Contemporary HPC hardware typically provides several levels of parallelism, e.g. multiple nodes, each having multiple cores (possibly with vectorization) and accelerators. Efficiently programming such systems usually requires skills in combining several low-level frameworks such as MPI, OpenMP, and CUDA. This overburdens programmers without substantial parallel programming skills. One way to overcome this problem and to […]
Jan, 15

OpenMP Advisor

With the increasing diversity of heterogeneous architecture in the HPC industry, porting a legacy application to run on different architectures is a tough challenge. In this paper, we present OpenMP Advisor, a first of its kind compiler tool that enables code offloading to a GPU with OpenMP using Machine Learning. Although the tool is currently […]
Jan, 15

Myths and Legends in High-Performance Computing

In this humorous and thought provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We collected those myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within (and beyond) our community. We believe they represent […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: