29390

Posts

Sep, 1

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

The rapid advancement of Artificial Intelligence (AI) necessitates significant enhancements in the energy efficiency of Graphics Processing Units (GPUs) for Deep Neural Network (DNN) workloads. Such a challenge is particularly critical for embedded GPUs, which operate within stringent power constraints. Traditional GPU architectures, designed to support a limited set of numeric formats, face challenges in […]
Sep, 1

Exploring Scalability in C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]
Sep, 1

A Parallel Compression Pipeline for Improving GPU Virtualization Data Transfers

GPUs are commonly used to accelerate the execution of applications in domains such as deep learning. Deep learning applications are applied to an increasing variety of scenarios, with edge computing being one of them. However, edge devices present severe computing power and energy limitations. In this context, the use of remote GPU virtualization solutions is […]
Sep, 1

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This […]
Aug, 25

Abstractions for C++ code optimizations in parallel high-performance applications

Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the […]
Aug, 25

Double-Precision Floating-Point Data Visualizations Using Vulkan API

Proper representation of data in graphical visualizations becomes challenging when high accuracy in data types is required, especially in those situations where the difference between double-precision floating-point and single-precision floating-point values makes a significant difference. Some of the limitations of using single-precision over double-precision include lesser accuracy, which accumulates errors over time, and poor modeling […]
Aug, 25

Confidential Computing on Heterogeneous Systems: Survey and Implications

In recent years, the widespread informatization and rapid data explosion have increased the demand for high-performance heterogeneous systems that integrate multiple computing cores such as CPUs, Graphics Processing Units (GPUs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), and Neural Processing Units (NPUs). The combination of CPU and GPU is particularly popular due […]
Aug, 25

CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations

Software developers must adapt to keep up with the changing capabilities of platforms so that they can utilize the power of High- Performance Computers (HPC), including exascale systems. OpenMP, a directive-based parallel programming model, allows developers to include directives to existing C, C++, or Fortran code to allow node level parallelism without compromising performance. This […]
Aug, 25

Characterizing CUDA and OpenMP Synchronization Primitives

Over the last two decades, parallelism has become the primary method for speeding up computer programs. When writing parallel code, it is often necessary to use synchronization primitives (e.g., atomics, barriers, or critical sections) to enforce correctness. However, the performance of synchronization primitives depends on a variety of complex factors that non-experts may be unaware […]
Aug, 18

The VerCors Verifier: A Progress Report

This paper gives an overview of the most recent developments on the VerCors verifier. VerCors is a deductive verifier for concurrent software, written in multiple programming languages, where the specifications are written in terms of pre-/postcondition contracts using permission-based separation logic. In essence, VerCors is a program transformation tool: it translates an annotated program into […]
Aug, 18

Portability of Fortran’s ‘do concurrent’ on GPUs

There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC […]
Aug, 18

HiCCL: A Hierarchical Collective Communication Library

HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: