Posts
Aug, 25
Abstractions for C++ code optimizations in parallel high-performance applications
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the […]
Aug, 25
Double-Precision Floating-Point Data Visualizations Using Vulkan API
Proper representation of data in graphical visualizations becomes challenging when high accuracy in data types is required, especially in those situations where the difference between double-precision floating-point and single-precision floating-point values makes a significant difference. Some of the limitations of using single-precision over double-precision include lesser accuracy, which accumulates errors over time, and poor modeling […]
Aug, 25
Confidential Computing on Heterogeneous Systems: Survey and Implications
In recent years, the widespread informatization and rapid data explosion have increased the demand for high-performance heterogeneous systems that integrate multiple computing cores such as CPUs, Graphics Processing Units (GPUs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), and Neural Processing Units (NPUs). The combination of CPU and GPU is particularly popular due […]
Aug, 25
CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations
Software developers must adapt to keep up with the changing capabilities of platforms so that they can utilize the power of High- Performance Computers (HPC), including exascale systems. OpenMP, a directive-based parallel programming model, allows developers to include directives to existing C, C++, or Fortran code to allow node level parallelism without compromising performance. This […]
Aug, 25
Characterizing CUDA and OpenMP Synchronization Primitives
Over the last two decades, parallelism has become the primary method for speeding up computer programs. When writing parallel code, it is often necessary to use synchronization primitives (e.g., atomics, barriers, or critical sections) to enforce correctness. However, the performance of synchronization primitives depends on a variety of complex factors that non-experts may be unaware […]
Aug, 18
The VerCors Verifier: A Progress Report
This paper gives an overview of the most recent developments on the VerCors verifier. VerCors is a deductive verifier for concurrent software, written in multiple programming languages, where the specifications are written in terms of pre-/postcondition contracts using permission-based separation logic. In essence, VerCors is a program transformation tool: it translates an annotated program into […]
Aug, 18
Portability of Fortran’s ‘do concurrent’ on GPUs
There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC […]
Aug, 18
HiCCL: A Hierarchical Collective Communication Library
HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially […]
Aug, 18
GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch
We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in […]
Aug, 18
Anatomizing Deep Learning Inference in Web Browsers
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this […]
Aug, 14
HIPRT: A Ray Tracing Framework in HIP
We present HIPRT, an open-source ray tracing framework in HIP. HIPRT provides a versatile, cross-platform solution for professional rendering on contemporary many-core architectures. The core of the framework relies on the bounding volume hierarchy (BVH) with scalable construction algorithms and efficient ray traversal, employing hardware acceleration on AMD GPUs. From a user perspective, we aim […]
Aug, 14
A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libraries and provides quantitative results on their performance. In […]