high performance computing on graphics processing units: hgpu.org

Posts

Mar, 8

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, […]

OpenCL

Mar, 5

Input Space Splitting for OpenCL

The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate […]

OpenCL

Mar, 3

Hadoop Mapreduce OpenCL Plugin

Modern systems generates huge amounts of information right from areas like finance, telematics, healthcare, IOT devices to name a few, the modern day computing frameworks like Mapreduce needs an ever increasing amount of computing power to sort, arrange and generate insights from the data. This project is an attempt to harness the power of heterogeneous […]

OpenCL

Feb, 23

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

OpenCL is a portable interface that can be used to program cluster nodes with heterogeneous compute devices. The OpenCL specification tightly binds its workflow abstraction, or "command queue", to a specific device for the entire program. For best performance, the user has to find the ideal queue-device mapping at command queue creation time, an effort […]

OpenCL

Feb, 23

VirtCL: a framework for OpenCL device abstraction and management

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running […]

OpenCL

Feb, 19

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. A key feature of our scheme is that […]

OpenCL

Feb, 8

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enables simultaneous computing in addition to simple off loading. The FPGA device is programmed via OpenCL using the recent availability of commercial […]

OpenCL

Feb, 4

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs

Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed […]

OpenCL

Jan, 29

GPU-Accelerated Recurrent Neural Networks: OpenCLLink and SymbolicC

The paper presents application of OpenCLLink in Wolfram Mathematica to accelerate fully recurrent neural networks using GPU. We also show the idea of automatically generated parts of source code using SymbolicC.

OpenCL

Jan, 14

A Case for Work-stealing on FPGAs with OpenCL Atomics

We provide a case study of work-stealing, a popular method for run-time load balancing, on FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize workitems not with locks, mutexes or critical sections, but instead with the atomic operations provided by Altera’s OpenCL SDK. We evaluate work-stealing for FPGAs by synthesizing a K-means clustering algorithm on […]

OpenCL

Dec, 31

Study of basic vector operations on Intel Xeon Phi and NVIDIA Tesla using OpenCL

The present work is an analysis of the performance of the basic vector operations AXPY, DOT and SpMV using OpenCL. The code was tested on the NVIDIA Tesla S2050 GPU and Intel Xeon Phi 3120A coprocessor. Due to the nature of the AXPY function, only two versions were implemented, the routine to be executed by […]

OpenCL

Dec, 23

SqueezCL: Squeezing OpenCL Kernels for Approximate Computing on Contemporary GPUs

Approximate computing provides an opportunity for exploiting application characteristics to improve performance of computing systems. However, such opportunity must be balanced against generality of methods and quality guarantees that the system designer can provide to the application developer. Improved parallel processing in graphics processing units (GPUs) provides one such means for data-level parallel applications. We […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations

Input Space Splitting for OpenCL

Hadoop Mapreduce OpenCL Plugin

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

VirtCL: a framework for OpenCL device abstraction and management

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs

GPU-Accelerated Recurrent Neural Networks: OpenCLLink and SymbolicC

A Case for Work-stealing on FPGAs with OpenCL Atomics

Study of basic vector operations on Intel Xeon Phi and NVIDIA Tesla using OpenCL

SqueezCL: Squeezing OpenCL Kernels for Approximate Computing on Contemporary GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)