Posts
Mar, 24
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers
While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of […]
Mar, 24
Full-Scale File System Acceleration on GPU
Modern HPC and AI Computing solutions regularly use GPUs as their main source of computational power. This creates a significant imbalance for storage operations for GPU applications, as every such storage operation has to be signalled to and handled by the CPU. In GPU4FS, we propose a radical solution to this imbalance: Move the file […]
Mar, 24
Retargeting and Respecializing GPU Workloads for Performance Portability
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance […]
Mar, 24
Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs
OpenMC is an open source Monte Carlo neutral particle transport application that has recently been ported to GPU using the OpenMP target offloading model. We examine the performance of OpenMC at scale on the Frontier, Polaris, and Aurora supercomputers, demonstrating that performance portability has been achieved by OpenMC across all three major GPU vendors (AMD, […]
Mar, 24
Parallel Gaussian process with kernel approximation in CUDA
This paper introduces a parallel implementation in CUDA/C++ of the Gaussian process with a decomposed kernel. This recent formulation, introduced by Joukov and Kulić (2022), is characterized by an approximated — but much smaller — matrix to be inverted compared to plain Gaussian process. However, it exhibits a limitation when dealing with higher-dimensional samples which […]
Mar, 18
Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors
We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which […]
Mar, 18
MUPPET: Optimizing Performance in OpenMP via Mutation Testing
Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet’s mutations […]
Mar, 18
SYCL in the edge: performance and energy evaluation for heterogeneous acceleration
Edge computing is essential to handle increasing data volumes and processing capacities. It provides real-time and secure data processing near data sources, like smart devices, alleviating cloud computing energy use, and saving network bandwidth. Specialized accelerators, like GPUs and FPGAs, are vital for low-latency edge computing but the requirements to customized code for different hardware […]
Mar, 18
Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality
Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches […]
Mar, 18
Predicting GPUDirect Benefits for HPC Workloads
Graphics processing units (GPUs) are becoming increasingly popular in modern HPC systems. Hardware for data movement to and from GPUs such as NVLink and GPUDirect has reduced latencies, increased throughput, and eliminated redundant copies. In this work, we use discrete event simulations to explore the impact of different communication paradigms on the messaging performance of […]
Mar, 10
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators
NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This […]
Mar, 10
Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators
Monte Carlo (MC) simulations play a pivotal role in diverse scientific and engineering domains, with applications ranging from nuclear physics to materials science. Harnessing the computational power of high-performance computing (HPC) systems, especially Graphics Processing Units (GPUs), has become essential for accelerating MC simulations. This paper focuses on the adaptation and optimization of the OpenMC […]