high performance computing on graphics processing units: hgpu.org

Posts

Jul, 12

Parallel Implementations for Solving Shortest Path Problem using Bellman-Ford

In this paper, different parallel implementations of Bellman-Ford algorithm on GPU using OpenCL are presented. These variants include Bellman-Ford for solving single source shortest path (SSSP) having two variants and Bellman-Ford for all pair shortest path (APSP) problems. Also, a comparative analysis of their performances on CPU and GPU is discussed in this paper.Write-write consistency […]

OpenCL

Jul, 7

GiMMiK – Generating Bespoke Matrix Multiplication Kernels for Various Hardware Accelerators; Applications in High-Order Computational Fluid Dynamics

Matrix multiplication is a fundamental linear algebra routine ubiquitous in all areas of science and engineering. Highly optimised BLAS libraries (cuBLAS and clBLAS on GPUs) are the most popular choices for an implementation of the General Matrix Multiply (GEMM) in software. However, performance of library GEMM is poor for small matrix sizes. In this thesis […]

CUDA

•

OpenCL

Jul, 6

A Parallelized Implementation for H. 264 Real-time Encoding Scheme

In this paper, a high-speed video stream encoder for the H.264 digital video codec standard specification is accelerated with nowadays parallel processing architectures. Based on the parallel processing techniques with GPU’s, we used an OpenCL-based GPU kernel programs, and finally achieved a high-level CPU-GPU interoperability. In its design, our system makes the CPU perform all […]

OpenCL

Jul, 6

High-level Parallel Programming Support for Heterogeneous Systems

This master thesis focuses on several high-level parallel programming models for heterogeneous systems that have been becoming increasingly popular in the field of high-performance computing. Heterogeneous systems are an inexpensive and effective way for further performance improvements. A powerful combination of graphics processing units (GPUs) and central processing units (CPUs) is one of the most […]

CUDA

•

OpenCL

Jul, 4

Writing self-adaptive codes for heterogeneous systems

Heterogeneous systems are becoming increasingly common. Relatedly, the popularity of OpenCL is growing, as it provides a unified mean to program a wide variety of devices including GPUs or multicore CPUs. More recently, the Heterogeneous Programming Library (HPL) targets the same variety of systems as OpenCL, intending to improve their programmability. The main drawback of […]

OpenCL

Jul, 4

A second generation of DEFG: Declarative Framework for GPUs

DEFG is our declarative language and framework for the efficient generation of OpenCL GPU applications. Using our new DEFG implementation, run-time and lines-of-code comparisons are provided for three well-known algorithms: Sobel image filtering, breadth-first search and all-pairs shortest path. The DEFG declarative language and corresponding OpenCL kernels provide complete OpenCL applications. The lines-of-code comparison demonstrates […]

OpenCL

Jul, 4

Parallel Implementation of Travelling Salesman Problem using Ant Colony Optimization

In this paper we have proposed parallel implementation of Ant colony optimization Ant System algorithm on GPU using OpenCL. We have done comparison on different parameters of the ACO which directly or indirectly affect the result. Parallel comparison of speedup between CPU and GPU implementation is done with a speed up of 3.11x in CPU […]

OpenCL

Jun, 24

AES encryption on modern consumer architectures

Specialized cryptographic processors target professional applications and offer both low latency and high throughput at the expense of cost. At the consumer level, a modern SoC embodies several accelerators and vector extensions (e.g. SSE, AES-NI), having a high degree of programmability through multiple APIs (OpenMP, OpenCL, etc). This work explains how a modern x86 system […]

OpenCL

Jun, 23

Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment

The paper presents design, implementation and real life uses of a visualization subsystem for a distributed framework for parallelization of work-flow-based computations among clusters with nodes that feature both CPUs and GPUs. Firstly, the proposed system presents a graphical view of the infrastructure with clusters, nodes and compute devices along with parameters and runtime graphs […]

OpenCL

Jun, 19

Parallel track reconstruction in CMS using the cellular automaton approach

The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is a general-purpose particle detector and comprises the largest silicon-based tracking system built to date with 75 million individual readout channels. The precise reconstruction of particle tracks from this tremendous amount of input channels is a compute-intensive task. The foreseen LHC beam parameters […]

OpenCL

Jun, 17

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel’s Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area — structured grid codes — and investigated techniques for ensuring performance portability across […]

OpenCL

Jun, 17

HAM – Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi

The applicability of accelerators is limited by the attainable speed-up for the offloaded computations and by the offloading overheads. While GPU programming models like CUDA and OpenCL only allow to optimise the application code and its speed-up, the available low-level APIs for the Intel Xeon Phi provide opportunity to address the overheads, too. This work […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Parallel Implementations for Solving Shortest Path Problem using Bellman-Ford

GiMMiK – Generating Bespoke Matrix Multiplication Kernels for Various Hardware Accelerators; Applications in High-Order Computational Fluid Dynamics

A Parallelized Implementation for H. 264 Real-time Encoding Scheme

High-level Parallel Programming Support for Heterogeneous Systems

Writing self-adaptive codes for heterogeneous systems

A second generation of DEFG: Declarative Framework for GPUs

Parallel Implementation of Travelling Salesman Problem using Ant Colony Optimization

AES encryption on modern consumer architectures

Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment

Parallel track reconstruction in CMS using the cellular automaton approach

On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures

HAM – Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi

Recent source codes

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

ROCm's implementation of Gromacs

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

Most viewed papers (last 30 days)