high performance computing on graphics processing units: hgpu.org

Posts

Jan, 28

A Heterogeneous Inference Framework for a Deep Neural Network

Artificial intelligence (AI) is one of the most promising technologies based on machine learning algorithms. In this paper, we propose a workflow for the implementation of deep neural networks. This workflow attempts to combine the flexibility of high-level compilers (HLS)-based networks with the architectural control features of hardware description languages (HDL)-based flows. The architecture consists […]

OpenCL

Jan, 14

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC

Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD’s past effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms. Recent […]

OpenCL

Jan, 14

Code Generation for a Variety of Accelerators for a Graph DSL

Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be […]

CUDA

•

OpenCL

Dec, 31

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs

General-purpose computing on graphics processing units (GPGPU) is increasingly used for number crunching tasks such as analyzing time series data. GPUs are a good fit for these tasks as they can execute many computations in parallel. To leverage this parallelism, the programmer is forced to carefully divide their input data into data blocks that are […]

OpenCL

Nov, 19

GPU Auto-tuning Framework for Optimal Performance and Power Consumption

An auto-tuning framework for GPU devices is presented for tuning application kernels of OpenCL. The GPU tuner employs multi-objective optimization methodology to improve the performance and power consumption of applications. It efficiently explores a user defined solution space comprising of possible tunable algorithmic and hardware counter variations through code transformations. The methodology targets GPU code […]

OpenCL

Nov, 12

On the Three P’s of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability

As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow. OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs. While functional portability across […]

OpenCL

Nov, 12

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark

Heterogeneous architectures consisting of both central processing units and graphics processing units are common in contemporary computer systems. For that reason, several programming models have been developed to exploit available parallelism, such as low-level CUDA and OpenCL, and directive-based OpenMP and OpenACC. In this paper we explore and evaluate the applicability of OpenACC, which is […]

CUDA

•

OpenCL

Nov, 5

Applying the Midas Touch of Reproducibility to High-Performance Computing

With the serial performance of CPUs improving exponentially through the 1980s and 1990s and then plateauing by the mid-2000s, the high-performance computing community has seen parallel computing become ubiquitous, which, in turn, has led to a proliferation of parallel programming models. This diversity in hardware platform and programming model has forced programmers to port their […]

CUDA

•

OpenCL

Nov, 5

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models

For many years, systems running Nvidia-based GPU architectures have dominated the heterogeneous supercomputer landscape. However, recently GPU chipsets manufactured by Intel and AMD have cut into this market and can now be found in some of the world’s fastest supercomputers. The June 2023 edition of the TOP500 list of supercomputers ranks the Frontier supercomputer at […]

CUDA

•

OpenCL

Oct, 15

Strega: An HTTP Server for FPGAs

The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to […]

OpenCL

Oct, 1

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

The Standard Portable Intermediate Representation (SPIR-V) is a low-level binary format designed for representing shaders and compute kernels that can be consumed by OpenCL for computing kernels, and Vulkan for graphics rendering. As a binary representation, SPIR-V is meant to be used by compilers and runtime systems, and is usually performed by C/C++ programs and […]

OpenCL

Sep, 24

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, […]

CUDA

•

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Heterogeneous Inference Framework for a Deep Neural Network

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC

Code Generation for a Variety of Accelerators for a Graph DSL

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs

GPU Auto-tuning Framework for Optimal Performance and Power Consumption

On the Three P’s of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark

Applying the Midas Touch of Reproducibility to High-Performance Computing

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models

Strega: An HTTP Server for FPGAs

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Recent source codes

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

ROCm's implementation of Gromacs

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

Most viewed papers (last 30 days)