high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

Collection skeletons: declarative abstractions for data collections

Modern programming languages provide programmers with rich abstractions for data collections as part of their standard libraries, e.g., Containers in the C++ STL, the Java Collections Framework, or the Scala Collections API. Typically, these collections frameworks are organised as hierarchies that provide programmers with common abstract data types (ADTs) like lists, queues, and stacks. While […]

OpenCL

Sep, 22

RenderKernel: High-level programming for real-time rendering systems

Real-time rendering applications leverage heterogeneous computing to optimize performance. However, software development across multiple devices presents challenges, including data layout inconsistencies, synchronization issues, resource management complexities, and architectural disparities. Additionally, the creation of such systems requires verbose and unsafe programming models. Recent developments in domain-specific and unified shading languages aim to mitigate these issues. Yet, […]

Sep, 22

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers […]

CUDA

Sep, 15

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share […]

Sep, 15

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

The growing diversity of hardware and software stacks adds additional development challenges to high-performance software as we move to exascale systems. Re- engineering software for each new platform is no longer practical due to increasing heterogeneity. Hardware designers are prioritizing AI/ML features like reduced precision that increase performance but sacrifice accuracy. The growing scale of […]

CUDA

Sep, 15

Refining HPCToolkit for application performance analysis at exascale

As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and […]

CUDA

•

OpenCL

Sep, 15

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

We present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with […]

CUDA

•

OpenCL

Sep, 15

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To […]

CUDA

Sep, 1

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

Over the past decade, various methods for detecting side-channel leakage have been proposed and proven to be effective against CPU side-channel attacks. These methods are valuable in assisting developers to identify and patch side-channel vulnerabilities. Nevertheless, recent research has revealed the feasibility of exploiting side-channel vulnerabilities to steal sensitive information from GPU applications, which are […]

CUDA

Sep, 1

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

The rapid advancement of Artificial Intelligence (AI) necessitates significant enhancements in the energy efficiency of Graphics Processing Units (GPUs) for Deep Neural Network (DNN) workloads. Such a challenge is particularly critical for embedded GPUs, which operate within stringent power constraints. Traditional GPU architectures, designed to support a limited set of numeric formats, face challenges in […]

CUDA

Sep, 1

Exploring Scalability in C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]

CUDA

Sep, 1

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

GPUs are commonly used to accelerate the execution of applications in domains such as deep learning. Deep learning applications are applied to an increasing variety of scenarios, with edge computing being one of them. However, edge devices present severe computing power and energy limitations. In this context, the use of remote GPU virtualization solutions is […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Collection skeletons: declarative abstractions for data collections

RenderKernel: High-level programming for real-time rendering systems

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Optimal Workload Placement on Multi-Instance GPUs

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

Refining HPCToolkit for application performance analysis at exascale

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Exploring Scalability in C++ Parallel STL Implementations

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

Recent source codes

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

kvcached: Elastic KV cache for dynamic GPU sharing and efficient multi-LLM inference

Bandicoot: C++ library for GPU accelerated linear algebra

Most viewed papers (last 30 days)