high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

RenderKernel: High-level programming for real-time rendering systems

Real-time rendering applications leverage heterogeneous computing to optimize performance. However, software development across multiple devices presents challenges, including data layout inconsistencies, synchronization issues, resource management complexities, and architectural disparities. Additionally, the creation of such systems requires verbose and unsafe programming models. Recent developments in domain-specific and unified shading languages aim to mitigate these issues. Yet, […]

Sep, 22

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers […]

CUDA

Sep, 15

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

We present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with […]

CUDA

•

OpenCL

Sep, 15

Optimal Workload Placement on Multi-Instance GPUs

There is an urgent and pressing need to optimize usage of Graphical Processing Units (GPUs), which have arguably become one of the most expensive and sought after IT resources. To help with this goal, several of the current generation of GPUs support a partitioning feature, called Multi-Instance GPU (MIG) to allow multiple workloads to share […]

Sep, 15

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

The growing diversity of hardware and software stacks adds additional development challenges to high-performance software as we move to exascale systems. Re- engineering software for each new platform is no longer practical due to increasing heterogeneity. Hardware designers are prioritizing AI/ML features like reduced precision that increase performance but sacrifice accuracy. The growing scale of […]

CUDA

Sep, 15

Refining HPCToolkit for application performance analysis at exascale

As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and […]

CUDA

•

OpenCL

Sep, 15

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To […]

CUDA

Sep, 1

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

The rapid advancement of Artificial Intelligence (AI) necessitates significant enhancements in the energy efficiency of Graphics Processing Units (GPUs) for Deep Neural Network (DNN) workloads. Such a challenge is particularly critical for embedded GPUs, which operate within stringent power constraints. Traditional GPU architectures, designed to support a limited set of numeric formats, face challenges in […]

CUDA

Sep, 1

Exploring Scalability in C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]

CUDA

Sep, 1

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

GPUs are commonly used to accelerate the execution of applications in domains such as deep learning. Deep learning applications are applied to an increasing variety of scenarios, with edge computing being one of them. However, edge devices present severe computing power and energy limitations. In this context, the use of remote GPU virtualization solutions is […]

CUDA

Sep, 1

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

Over the past decade, various methods for detecting side-channel leakage have been proposed and proven to be effective against CPU side-channel attacks. These methods are valuable in assisting developers to identify and patch side-channel vulnerabilities. Nevertheless, recent research has revealed the feasibility of exploiting side-channel vulnerabilities to steal sensitive information from GPU applications, which are […]

CUDA

Sep, 1

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

RenderKernel: High-level programming for real-time rendering systems

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

Optimal Workload Placement on Multi-Instance GPUs

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications

Refining HPCToolkit for application performance analysis at exascale

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Exploring Scalability in C++ Parallel STL Implementations

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

Owl: Differential-based Side-Channel Leakage Detection for CUDA Applications

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)