high performance computing on graphics processing units: hgpu.org

Posts

Feb, 3

Efficient SIMD Vectorization for Hashing in OpenCL

Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – […]

OpenCL

Dec, 24

Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs

Heterogeneous CPU-FPGA systems are gaining momentum in the embedded systems sector and in the data center market. While the programming abstractions for implementing the data transfer between CPU and FPGA (and vice versa) that are available in today’s commercial programming tools are well-suited for certain types of applications, the CPU-FPGA communication for applications that share […]

OpenCL

Dec, 24

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To […]

OpenCL

Dec, 19

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

To meet the requirements of railway track point cloud processing, an OpenCL-accelerated Point Feature Histogram method is proposed using heterogeneous computing to improve the low computation efficiency. According to the characteristics of parallel computing of OpenCL, the data structure for point cloud storage is reconfigured. With the kernel performance analysis by CodeXL, the data reading […]

OpenCL

Dec, 10

Acceleration of Cellular Automata through Parallel Computing with OpenCL

Cellular Automata (CA) have its origins in the work of Von Neumann and, since then, have become an important research topic with a wide range of applications, ranging from DNA sequencing to ecological dynamics. One aspect that may be of interest during a CA simulation is the evolution in the number of individuals of each […]

OpenCL

Dec, 10

FPGA-Accelerated Image Processing Using High Level Synthesis with OpenCL

High Level Synthesis (HLS) is a new method for developing applications for use on FPGAs. Instead of the classic approach using a Hardware Descriptive Language (HDL), a high level programming language can be used. HLS has many perks, including high level debugging and simulation of the system being developed. This shortens the development time which […]

OpenCL

Nov, 30

Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization

This document intends to provide a detailed guidance on how to optimize OpenCL programs with Adreno GPUs. A good amount of information has been provided to help developers understand the OpenCL fundamentals and Adreno architectures, and most importantly, master OpenCL optimization techniques. OpenCL optimization is often challenging and requires a lot of trial and error. […]

OpenCL

Nov, 30

Intel FPGA SDK for OpenCL

The Intel FPGA SDK for OpenCL Programming Guide provides descriptions, recommendations and usage information on the Intel Software Development Kit (SDK) for OpenCL compiler and tools. The Intel FPGA SDK for OpenCL is an OpenCL-based heterogeneous parallel programming environment for Intel FPGA products.

OpenCL

Nov, 26

High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL

The Smith-Waterman algorithm is widely used in bioinformatics and is often used as a benchmark of FPGA performance. Here we present our highly optimized SmithWaterman implementation on Intel FPGAs using OpenCL. Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the […]

OpenCL

Nov, 21

Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an […]

OpenCL

Nov, 16

Launch-time Optimization of OpenCL Kernels

OpenCL kernels are compiled first before kernel arguments and launch geometry are provided later at launch time. Although some of these values remain constant during execution, the compiler is unable to optimize for them since it has no access to them. We propose and implement a novel approach that identifies such arguments, geometry, and optimizations […]

OpenCL

Oct, 31

PCIeHLS: an OpenCL HLS framework

One of the goals of high level synthesis (HLS) is to make designing hardware accelerators running on FPGAs accessible to developers with a software background (usually implying developers with little foundations in hardware design). While high level synthesis generates accelerator kernels, it generally does not assist with integrating the generated kernels into a system. In […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Efficient SIMD Vectorization for Hashing in OpenCL

Pass a Pointer: Exploring Shared Virtual Memory Abstractions in OpenCL Tools for FPGAs

Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

Acceleration of Cellular Automata through Parallel Computing with OpenCL

FPGA-Accelerated Image Processing Using High Level Synthesis with OpenCL

Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization

Intel FPGA SDK for OpenCL

High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL

Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR

Launch-time Optimization of OpenCL Kernels

PCIeHLS: an OpenCL HLS framework

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)