28103

Posts

Apr, 2

Managing heterogeneous device memory using C++17 memory resources

Programmers using the C++ programming language are increasingly taught to manage memory implicitly through containers provided by the C++ standard library. However, heterogeneous programming platforms often require explicit allocation and deallocation of memory. This discrepancy in memory management strategies can be daunting and problematic for C++ developers who are not already familiar with heterogeneous programming. […]
Apr, 2

PopSparse: Accelerated block sparse matrix multiplication on IPU

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such […]
Apr, 2

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

We propose Pgx, a collection of board game simulators written in JAX. Thanks to auto-vectorization and Just-In-Time compilation of JAX, Pgx scales easily to thousands of parallel execution on GPU/TPU accelerators. We found that the simulation of Pgx on a single A100 GPU is 10x faster than that of existing reinforcement learning libraries. Pgx implements […]
Mar, 26

Comparing SYCL data transfer strategies for tracking use cases

The aim of this work is to compare the performance and ease of programming of the various data transfer strategies provided by SYCL 2020: buffers/accessors on one hand and the different storage types exposed by Unified Shared Memory (USM) on the other hand. We measured the relative performance of USM exclusively located either on the […]
Mar, 26

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

With the increasing popularity of accelerator technologies (e.g., GPUs and TPUs) and the emergence of domain-specific computing via ASICs and FPGA, the matter of heterogeneity and understanding its ramifications on the performance has become more critical than ever before. However, it is challenging to effectively educate students about the potential impacts of heterogeneity on the […]
Mar, 26

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis

High Level Synthesis (HLS) offers a possible programmability solution for FPGAs by automatically compiling CPU codes to custom hardware configurations, but currently delivers far lower hardware quality than circuits written using Hardware Description Languages (HDLs). One reason is because the standard set of code optimizations used by CPU compilers, such as LLVM, are not well […]
Mar, 26

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools – like Kernel Tuner – have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with […]
Mar, 26

DSDP: A Blind Docking Strategy Accelerated by GPUs

Virtual screening, including molecular docking, plays an essential role in drug discovery. Many traditional and machine-learning based methods are available to fulfil the docking task. The traditional docking methods are normally extensively time-consuming, and their performance in blind docking remains to be improved. Although the runtime of docking based on machine learning is significantly decreased, […]
Mar, 19

Challenges and Opportunities in C/C++ Source-To-Source Compilation

The C/C++ compilation stack (Intermediate Representations (IRs), compilation passes and backends) is encumbered by a steep learning curve, which we believe can be lowered by complementing it with approaches such as source-to-source compilation. Source-to-source compilation is a technology that is widely used and quite mature in certain programming environments, such as JavaScript, but that faces […]
Mar, 19

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, […]
Mar, 19

Statistical Computing With Graphics Processing Units

This thesis consists of two main projects and a third project which is provided in the appendix. The contribution of the first project, is a tool set for parallel random number gen- eration on GPUs in R, namely, the clrng package. This package is currently the only R package that provides facilities for generating random […]
Mar, 19

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as possible from the user, so that the code can be used efficiently across different generations of systems. In this article […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: