high performance computing on graphics processing units: hgpu.org

Posts

Oct, 15

Strega: An HTTP Server for FPGAs

The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to […]

OpenCL

Oct, 15

Comparison of different n-body algorithms on various hardware platforms using SYCL

The n-body problem has various applications in different fields of science such as astrophysics, where it describes the problem of calculating the movements of n different bodies which all interact with each other over time. There exist different algorithms that solve the n-body problem, for example, the naive approach and the Barnes-Hut algorithm. Since applications […]

Oct, 15

Open SYCL on heterogeneous GPU systems: A case of study

Computational platforms for high-performance scientific applications are becoming more heterogenous, including hardware accelerators such as multiple GPUs. Applications in a wide variety of scientific fields require an efficient and careful management of the computational resources of this type of hardware to obtain the best possible performance. However, there are currently different GPU vendors, architectures and […]

CUDA

Oct, 8

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload

GPUs are well-known for their remarkable ability to accelerate computations through massive parallelism. However, offloading computations to GPUs necessitates manual identification of code regions that should be executed on the device, memory that needs to be transferred, and synchronization to be handled. Recent work has leveraged the portable target offloading interface provided by LLVM/OpenMP, taking […]

Oct, 8

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library

As supercomputers become larger with powerful Graphics Processing Unit (GPU), traditional direct eigensolvers struggle to keep up with the hardware evolution and scale efficiently due to communication and synchronization demands. Conversely, subspace eigensolvers, like the Chebyshev Accelerated Subspace Eigensolver (ChASE), have a simpler structure and can overcome communication and synchronization bottlenecks. ChASE is a modern […]

CUDA

Oct, 8

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow

High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in […]

OpenCL

Oct, 8

Impacts of Parallel Programming on Limited-Resource Hardware

Limited resource hardware devices are more affordable and energy efficient than high-end hardware. Despite their reduced size, these devices are increasingly complex, with many now featuring multiple processing cores, GPGPU accelerators, and larger RAM capacity. To fully utilize their computational capacity, software developers must exploit parallelism, but this adds an extra layer of complexity because […]

CUDA

•

OpenCL

Oct, 8

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang

MLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built […]

Oct, 1

Memory Efficient Mixed-Precision Optimizers

Traditional optimization methods rely on the use of single-precision floating point arithmetic, which can be costly in terms of memory size and computing power. However, mixed precision optimization techniques leverage the use of both single and half-precision floating point arithmetic to reduce memory requirements while maintaining model accuracy. We provide here an algorithm to further […]

CUDA

Oct, 1

Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs

Cas-OFFinder is a popular application written in OpenCL for searching potential off-target sites in parallel on a GPU. In this work, we describe our experience of migrating the application from OpenCL to SYCL. Evaluating the performance of the OpenCL and SYCL application using human genome sequences shows that the SYCL program could achieve performance portability […]

OpenCL

Oct, 1

OpenMP Kernel Language Extensions for Performance Portable GPU Codes

In contemporary high-performance computing architectures, the integration of GPU accelerators has become increasingly prevalent. To harness the full potential of these accelerators, developers often resort to vendor-specific kernel languages, such as CUDA. While this approach ensures optimal efficiency, it inherently compromises portability and engenders vendor dependency. Existing portable programming models, such as OpenMP, while promising, […]

CUDA

Oct, 1

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

The Standard Portable Intermediate Representation (SPIR-V) is a low-level binary format designed for representing shaders and compute kernels that can be consumed by OpenCL for computing kernels, and Vulkan for graphics rendering. As a binary representation, SPIR-V is meant to be used by compilers and runtime systems, and is usually performed by C/C++ programs and […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Strega: An HTTP Server for FPGAs

Comparison of different n-body algorithms on various hardware platforms using SYCL

Open SYCL on heterogeneous GPU systems: A case of study

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow

Impacts of Parallel Programming on Limited-Resource Hardware

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang

Memory Efficient Mixed-Precision Optimizers

Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs

OpenMP Kernel Language Extensions for Performance Portable GPU Codes

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)