high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL

As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA tooling and its problems. To do so, we evaluate the performance portability of two frameworks for developing FPGA solutions for […]

OpenCL

Sep, 22

Collection skeletons: declarative abstractions for data collections

Modern programming languages provide programmers with rich abstractions for data collections as part of their standard libraries, e.g., Containers in the C++ STL, the Java Collections Framework, or the Scala Collections API. Typically, these collections frameworks are organised as hierarchies that provide programmers with common abstract data types (ADTs) like lists, queues, and stacks. While […]

OpenCL

Sep, 22

RenderKernel: High-level programming for real-time rendering systems

Real-time rendering applications leverage heterogeneous computing to optimize performance. However, software development across multiple devices presents challenges, including data layout inconsistencies, synchronization issues, resource management complexities, and architectural disparities. Additionally, the creation of such systems requires verbose and unsafe programming models. Recent developments in domain-specific and unified shading languages aim to mitigate these issues. Yet, […]

Sep, 22

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers […]

CUDA

Sep, 15

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

We present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with […]

CUDA

•

OpenCL

Sep, 15

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share […]

Sep, 15

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

The growing diversity of hardware and software stacks adds additional development challenges to high-performance software as we move to exascale systems. Re- engineering software for each new platform is no longer practical due to increasing heterogeneity. Hardware designers are prioritizing AI/ML features like reduced precision that increase performance but sacrifice accuracy. The growing scale of […]

CUDA

Sep, 15

Refining HPCToolkit for application performance analysis at exascale

As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and […]

CUDA

•

OpenCL

Sep, 15

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To […]

CUDA

Sep, 1

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

Over the past decade, various methods for detecting side-channel leakage have been proposed and proven to be effective against CPU side-channel attacks. These methods are valuable in assisting developers to identify and patch side-channel vulnerabilities. Nevertheless, recent research has revealed the feasibility of exploiting side-channel vulnerabilities to steal sensitive information from GPU applications, which are […]

CUDA

Sep, 1

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

The rapid advancement of Artificial Intelligence (AI) necessitates significant enhancements in the energy efficiency of Graphics Processing Units (GPUs) for Deep Neural Network (DNN) workloads. Such a challenge is particularly critical for embedded GPUs, which operate within stringent power constraints. Traditional GPU architectures, designed to support a limited set of numeric formats, face challenges in […]

CUDA

Sep, 1

Exploring Scalability in C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL

Collection skeletons: declarative abstractions for data collections

RenderKernel: High-level programming for real-time rendering systems

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

Optimal Workload Placement on Multi-Instance GPUs

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

Refining HPCToolkit for application performance analysis at exascale

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Exploring Scalability in C++ Parallel STL Implementations

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)