high performance computing on graphics processing units: hgpu.org

Posts

Sep, 1

Exploring Scalability in C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]

CUDA

Sep, 1

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

GPUs are commonly used to accelerate the execution of applications in domains such as deep learning. Deep learning applications are applied to an increasing variety of scenarios, with edge computing being one of them. However, edge devices present severe computing power and energy limitations. In this context, the use of remote GPU virtualization solutions is […]

CUDA

Sep, 1

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This […]

CUDA

Aug, 25

Abstractions for C++ code optimizations in parallel high-performance applications

Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the […]

CUDA

Aug, 25

Double-Precision Floating-Point Data Visualizations Using Vulkan API

Proper representation of data in graphical visualizations becomes challenging when high accuracy in data types is required, especially in those situations where the difference between double-precision floating-point and single-precision floating-point values makes a significant difference. Some of the limitations of using single-precision over double-precision include lesser accuracy, which accumulates errors over time, and poor modeling […]

Aug, 25

Confidential Computing on Heterogeneous Systems: Survey and Implications

In recent years, the widespread informatization and rapid data explosion have increased the demand for high-performance heterogeneous systems that integrate multiple computing cores such as CPUs, Graphics Processing Units (GPUs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), and Neural Processing Units (NPUs). The combination of CPU and GPU is particularly popular due […]

CUDA

Aug, 25

CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations

Software developers must adapt to keep up with the changing capabilities of platforms so that they can utilize the power of High- Performance Computers (HPC), including exascale systems. OpenMP, a directive-based parallel programming model, allows developers to include directives to existing C, C++, or Fortran code to allow node level parallelism without compromising performance. This […]

Aug, 25

Characterizing CUDA and OpenMP Synchronization Primitives

Over the last two decades, parallelism has become the primary method for speeding up computer programs. When writing parallel code, it is often necessary to use synchronization primitives (e.g., atomics, barriers, or critical sections) to enforce correctness. However, the performance of synchronization primitives depends on a variety of complex factors that non-experts may be unaware […]

CUDA

Aug, 18

The VerCors Verifier: A Progress Report

This paper gives an overview of the most recent developments on the VerCors verifier. VerCors is a deductive verifier for concurrent software, written in multiple programming languages, where the specifications are written in terms of pre-/postcondition contracts using permission-based separation logic. In essence, VerCors is a program transformation tool: it translates an annotated program into […]

CUDA

•

OpenCL

Aug, 18

Portability of Fortran’s ‘do concurrent’ on GPUs

There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC […]

Aug, 18

HiCCL: A Hierarchical Collective Communication Library

HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially […]

CUDA

Aug, 18

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploring Scalability in C++ Parallel STL Implementations

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Abstractions for C++ code optimizations in parallel high-performance applications

Double-Precision Floating-Point Data Visualizations Using Vulkan API

Confidential Computing on Heterogeneous Systems: Survey and Implications

CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations

Characterizing CUDA and OpenMP Synchronization Primitives

The VerCors Verifier: A Progress Report

Portability of Fortran’s ‘do concurrent’ on GPUs

HiCCL: A Hierarchical Collective Communication Library

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)