high performance computing on graphics processing units: hgpu.org

Posts

Apr, 23

Simple and efficient GPU accelerated topology optimisation: Codes and applications

This work presents topology optimisation implementations for linear elastic compliance minimisation in three dimensions, accelerated using Graphics Processing Units (GPUs). Three different open-source implementations are presented for linear problems. Two implementations use GPU acceleration, based on either OpenMP 4.5 or the Futhark language to implement the hardware acceleration. Both GPU implementations are based on high […]

CUDA

•

OpenCL

Apr, 23

Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages

Compilers are part of the foundation upon which software systems are built; they need to be as correct as possible. This paper is about stress-testing loop optimizers; it presents a major reimplementation of Yet Another Random Program Generator (YARPGen), an open-source generative compiler fuzzer. This new version has found 122 bugs, both in compilers for […]

Apr, 23

Thread-safe lattice Boltzmann for high-performance computing on GPUs

We present thread-safe, highly-optimized lattice Boltzmann implementations, specifically aimed at exploiting the high memory bandwidth of GPU-based architectures. At variance with standard approaches to LB coding, the proposed strategy, based on the reconstruction of the post-collision distribution via Hermite projection, enforces data locality and avoids the onset of memory dependencies, which may arise during the […]

CUDA

Apr, 23

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Today’s graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from […]

CUDA

Apr, 23

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

We detail the performance optimizations made in rocHPL, AMD’s open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform […]

Apr, 16

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Our goal is to have a better understanding of performance portability of SYCL kernels on a GPU. Toward this goal, we migrate representative kernels in bioinformatics applications from CUDA to SYCL, evaluate their performance on an NVIDIA GPU, and explain the performance gaps through performance profiling and analyses. We hope that the findings provide valuable […]

CUDA

Apr, 16

Kernel Tuning Toolkit

Kernel Tuning Toolkit (KTT) is an autotuning framework for CUDA, OpenCL and Vulkan kernels. KTT provides advanced autotuning features such as support for both dynamic (online) and offline tuning, and an ability to tune multiple kernels together with shared tuning parameters. Furthermore, it offers customization features that make integration into larger software suites possible. The […]

CUDA

•

OpenCL

Apr, 16

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from […]

Apr, 16

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced […]

Apr, 16

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion […]

Apr, 2

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime […]

Apr, 2

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

In the world of artificial intelligence (AI) at the edge, we need to focus primarily on the energy efficiency with which we approach deep neural network (DNN) applications. In many applications, the speed of obtaining an inference can be critical; but many applications easily meet their time requirements, and the energy needed to calculate the […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Simple and efficient GPU accelerated topology optimisation: Codes and applications

Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages

Thread-safe lattice Boltzmann for high-performance computing on GPUs

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Kernel Tuning Toolkit

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Energy-Efficient GPU Clusters Scheduling for Deep Learning

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)