high performance computing on graphics processing units: hgpu.org

Posts

Jun, 27

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically […]

OpenCL

Jun, 20

Study and evaluation of improved automatic GPU offloading method

With the slowing down of Moore’s law, the use of hardware other than CPUs, such as graphics processing units (GPUs) or field-Programmable gate arrays (FPGAs), is increasing. However, when using heterogeneous hardware other than CPUs, barriers to technical skills, such for compute unified device architecture (CUDA) and open computing language (OpenCL), are high. Therefore, I […]

CUDA

•

OpenCL

May, 23

Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation

Although Graphics Processing Units (GPUs) have become pervasive for data-parallel workloads, the efficient exploitation of their tiered memory hierarchy requires explicit programming. The efficient utilization of different GPU memory tiers can yield higher performance at the expense of programmability since developers must have extended knowledge of the architectural details in order to utilize them. In […]

OpenCL

May, 16

Raster Time Series: Learning and Processing

As the amount of remote sensing data is increasing at a high rate, due to great improvements in sensor technology, efficient processing capabilities are of utmost importance. Remote sensing data from satellites is crucial in many scientific domains, like biodiversity and climate research. Because weather and climate are of particular interest for almost all living […]

OpenCL

May, 9

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

SYCL is a modern high-level C++ programming interface which excels at expressing data parallelism for heterogeneous hardware platforms in a programmer-friendly way, and is standardized by the Khronos Group. The latest version of the standard, SYCL 2020, removes the previous dependence of the specification and its implementations on an underlying OpenCL target, opening the door […]

OpenCL

May, 9

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse set of compute accelerator architectures through our novel compiler borG. Be it weather prediction, climate models, personalized medicine, genetic analysis and autonomous driving: some of […]

OpenCL

Apr, 18

SLATE port to AMD and Intel platforms

SLATE implements GPU-accelerated linear algebra, relying primarily on vendor-provided GPU BLAS for performance, in particular batched BLAS routines. Initially, SLATE was written using NVIDIA’s CUDA and cuBLAS for GPU acceleration. At the time that the SLATE project was started, it was unclear what GPU technologies would exist for other platforms [1]. Since then, AMD has […]

CUDA

•

OpenCL

Apr, 11

Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes

Modern commodity devices are nowadays equipped with a plethora of heterogeneous devices serving different purposes. Being able to exploit such heterogeneous hardware accelerators to their full potential is of paramount importance in the pursuit of higher performance and energy efficiency. Towards these objectives, the reduction of idle time of each device as well as the […]

OpenCL

Apr, 5

Parallel Arbitrary-precision Integer Arithmetic

Arbitrary-precision integer arithmetic computations are driven by applications in solving systems of polynomial equations and public-key cryptography. Such computations arise when high precision is required (with large input values that fit into multiple machine words), or to avoid coefficient overflow due to intermediate expression swell. Meanwhile, the growing demand for faster computation alongside the recent […]

CUDA

•

OpenCL

Feb, 28

Accelerating AutoDock4 with GPUs and Gradient-Based Local Search

AutoDock4 is a widely used program for docking small molecules to macromolecular targets. It describes ligand–receptor interactions using a physics-inspired scoring function that has been proven useful in a variety of drug discovery projects. However, compared to more modern and recent software, AutoDock4 has longer execution times, limiting its applicability to large scale dockings. To […]

OpenCL

Feb, 21

Optimal program variant generation for hybrid manycore systems

Field Programmable Gate Arrays promise to deliver superior energy efficiency in heterogeneous high performance computing, as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel’s HLS address a large part of the programmability […]

OpenCL

Feb, 14

Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming

In the paper we investigate a practical approach to application of integer linear programming for optimization of data assignment to compute units in a multi-level heterogeneous environment with various compute devices, including CPUs, GPUs and Intel Xeon Phis. The model considers an application that processes a large number of data chunks in parallel on various […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

Study and evaluation of improved automatic GPU offloading method

Automatically Exploiting the Memory Hierarchy of GPUs through Just-in-Time Compilation

Raster Time Series: Learning and Processing

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

SLATE port to AMD and Intel platforms

Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes

Parallel Arbitrary-precision Integer Arithmetic

Accelerating AutoDock4 with GPUs and Gradient-Based Local Search

Optimal program variant generation for hybrid manycore systems

Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)