Apr, 7

Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling

Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices […]
Mar, 24

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

While polyhedral compilers have shown success in implementing advanced code transformations, they still have challenges in selecting the most profitable transformations that lead to the best speedups. This has motivated the use of machine learning to build cost models to guide the search for polyhedral optimizations. State-of-the-art polyhedral compilers have demonstrated a viable proof-of-concept of […]
Mar, 24

Full-Scale File System Acceleration on GPU

Modern HPC and AI Computing solutions regularly use GPUs as their main source of computational power. This creates a significant imbalance for storage operations for GPU applications, as every such storage operation has to be signalled to and handled by the CPU. In GPU4FS, we propose a radical solution to this imbalance: Move the file […]
Mar, 24

Retargeting and Respecializing GPU Workloads for Performance Portability

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance […]
Mar, 24

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

OpenMC is an open source Monte Carlo neutral particle transport application that has recently been ported to GPU using the OpenMP target offloading model. We examine the performance of OpenMC at scale on the Frontier, Polaris, and Aurora supercomputers, demonstrating that performance portability has been achieved by OpenMC across all three major GPU vendors (AMD, […]
Mar, 24

Parallel Gaussian process with kernel approximation in CUDA

This paper introduces a parallel implementation in CUDA/C++ of the Gaussian process with a decomposed kernel. This recent formulation, introduced by Joukov and Kulić (2022), is characterized by an approximated — but much smaller — matrix to be inverted compared to plain Gaussian process. However, it exhibits a limitation when dealing with higher-dimensional samples which […]
Mar, 18

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which […]
Mar, 18

MUPPET: Optimizing Performance in OpenMP via Mutation Testing

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet’s mutations […]
Mar, 18

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

Edge computing is essential to handle increasing data volumes and processing capacities. It provides real-time and secure data processing near data sources, like smart devices, alleviating cloud computing energy use, and saving network bandwidth. Specialized accelerators, like GPUs and FPGAs, are vital for low-latency edge computing but the requirements to customized code for different hardware […]
Mar, 18

Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality

Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches […]
Mar, 18

Predicting GPUDirect Benefits for HPC Workloads

Graphics processing units (GPUs) are becoming increasingly popular in modern HPC systems. Hardware for data movement to and from GPUs such as NVLink and GPUDirect has reduced latencies, increased throughput, and eliminated redundant copies. In this work, we use discrete event simulations to explore the impact of different communication paradigms on the messaging performance of […]
Mar, 10

Hybrid quantum programming with PennyLane Lightning on HPC platforms

We introduce PennyLane’s Lightning suite, a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures and showcase the scale of problems that can be simulated using our tooling. We benchmark the performance of […]

