high performance computing on graphics processing units: hgpu.org

Posts

Apr, 2

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

We propose Pgx, a collection of board game simulators written in JAX. Thanks to auto-vectorization and Just-In-Time compilation of JAX, Pgx scales easily to thousands of parallel execution on GPU/TPU accelerators. We found that the simulation of Pgx on a single A100 GPU is 10x faster than that of existing reinforcement learning libraries. Pgx implements […]

Mar, 26

Comparing SYCL data transfer strategies for tracking use cases

The aim of this work is to compare the performance and ease of programming of the various data transfer strategies provided by SYCL 2020: buffers/accessors on one hand and the different storage types exposed by Unified Shared Memory (USM) on the other hand. We measured the relative performance of USM exclusively located either on the […]

Mar, 26

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

With the increasing popularity of accelerator technologies (e.g., GPUs and TPUs) and the emergence of domain-specific computing via ASICs and FPGA, the matter of heterogeneity and understanding its ramifications on the performance has become more critical than ever before. However, it is challenging to effectively educate students about the potential impacts of heterogeneity on the […]

Mar, 26

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis

High Level Synthesis (HLS) offers a possible programmability solution for FPGAs by automatically compiling CPU codes to custom hardware configurations, but currently delivers far lower hardware quality than circuits written using Hardware Description Languages (HDLs). One reason is because the standard set of code optimizations used by CPU compilers, such as LLVM, are not well […]

Mar, 26

DSDP: A Blind Docking Strategy Accelerated by GPUs

Virtual screening, including molecular docking, plays an essential role in drug discovery. Many traditional and machine-learning based methods are available to fulfil the docking task. The traditional docking methods are normally extensively time-consuming, and their performance in blind docking remains to be improved. Although the runtime of docking based on machine learning is significantly decreased, […]

CUDA

Mar, 26

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools – like Kernel Tuner – have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with […]

CUDA

Mar, 19

Challenges and Opportunities in C/C++ Source-To-Source Compilation

The C/C++ compilation stack (Intermediate Representations (IRs), compilation passes and backends) is encumbered by a steep learning curve, which we believe can be lowered by complementing it with approaches such as source-to-source compilation. Source-to-source compilation is a technology that is widely used and quite mature in certain programming environments, such as JavaScript, but that faces […]

CUDA

•

OpenCL

Mar, 19

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, […]

CUDA

Mar, 19

Statistical Computing With Graphics Processing Units

This thesis consists of two main projects and a third project which is provided in the appendix. The contribution of the first project, is a tool set for parallel random number gen- eration on GPUs in R, namely, the clrng package. This package is currently the only R package that provides facilities for generating random […]

Mar, 19

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Heterogeneity has become a mainstream architecture design choice for building High Performance Computing systems. However, heterogeneity poses significant challenges for achieving performance portability of execution. Adapting a program to a new heterogeneous platform is laborious and requires developers to manually explore a vast space of execution parameters. To address those challenges, this paper proposes new […]

CUDA

Mar, 19

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as possible from the user, so that the code can be used efficiently across different generations of systems. In this article […]

CUDA

Mar, 12

ARK: GPU-driven Code Execution for Distributed Deep Learning

Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

Comparing SYCL data transfer strategies for tracking use cases

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis

DSDP: A Blind Docking Strategy Accelerated by GPUs

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Challenges and Opportunities in C/C++ Source-To-Source Compilation

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Statistical Computing With Graphics Processing Units

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Towards a Benchmarking Suite for Kernel Tuners

ARK: GPU-driven Code Execution for Distributed Deep Learning

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)