28174

Posts

Apr, 23

Simple and efficient GPU accelerated topology optimisation: Codes and applications

This work presents topology optimisation implementations for linear elastic compliance minimisation in three dimensions, accelerated using Graphics Processing Units (GPUs). Three different open-source implementations are presented for linear problems. Two implementations use GPU acceleration, based on either OpenMP 4.5 or the Futhark language to implement the hardware acceleration. Both GPU implementations are based on high […]
Apr, 23

Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages

Compilers are part of the foundation upon which software systems are built; they need to be as correct as possible. This paper is about stress-testing loop optimizers; it presents a major reimplementation of Yet Another Random Program Generator (YARPGen), an open-source generative compiler fuzzer. This new version has found 122 bugs, both in compilers for […]
Apr, 23

Thread-safe lattice Boltzmann for high-performance computing on GPUs

We present thread-safe, highly-optimized lattice Boltzmann implementations, specifically aimed at exploiting the high memory bandwidth of GPU-based architectures. At variance with standard approaches to LB coding, the proposed strategy, based on the reconstruction of the post-collision distribution via Hermite projection, enforces data locality and avoids the onset of memory dependencies, which may arise during the […]
Apr, 23

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Today’s graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from […]
Apr, 23

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

We detail the performance optimizations made in rocHPL, AMD’s open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform […]
Apr, 16

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Our goal is to have a better understanding of performance portability of SYCL kernels on a GPU. Toward this goal, we migrate representative kernels in bioinformatics applications from CUDA to SYCL, evaluate their performance on an NVIDIA GPU, and explain the performance gaps through performance profiling and analyses. We hope that the findings provide valuable […]
Apr, 16

Kernel Tuning Toolkit

Kernel Tuning Toolkit (KTT) is an autotuning framework for CUDA, OpenCL and Vulkan kernels. KTT provides advanced autotuning features such as support for both dynamic (online) and offline tuning, and an ability to tune multiple kernels together with shared tuning parameters. Furthermore, it offers customization features that make integration into larger software suites possible. The […]
Apr, 16

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from […]
Apr, 16

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced […]
Apr, 16

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion […]
Apr, 2

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime […]
Apr, 2

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

In the world of artificial intelligence (AI) at the edge, we need to focus primarily on the energy efficiency with which we approach deep neural network (DNN) applications. In many applications, the speed of obtaining an inference can be critical; but many applications easily meet their time requirements, and the energy needed to calculate the […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: