high performance computing on graphics processing units: hgpu.org

Posts

Sep, 19

Automatic OpenCL code generation for multi-device heterogeneous architectures

Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially nonuniform domain […]

OpenCL

Sep, 19

Automatic Online Tuning (AutoTune): Fully Extended Analysis

The AutoTune project develops the Periscope Tuning Framework (PTF) including several plugins targeting performance improvements as well as to reduce energy consumption of applications. One of the main advantages of PTF over other tuning frameworks is its capability to combine tuning and analysis strategies to simplify and speed up the tuning process. To support the […]

OpenCL

Sep, 19

Parallel Decompression of Seismic Data on GPU Using a Lifting Wavelet Algorithm

Subsurface images are widely used by the oil companies to find oil reservoirs. The construction of these images involves to collect and process a huge amount of seismic data. Generally, the oil companies use compression algorithms to reduce the storage and transmission costs. Currently, the compression process is developed on-site using CPU architectures, whereas the […]

CUDA

Sep, 19

Autotuning Wavefront Patterns for Heterogeneous Architectures

Manual tuning of applications for heterogeneous parallel systems is tedious and complex. Optimizations are often not portable, and the whole process must be repeated when moving to a new system, or sometimes even to a different problem size. Pattern based parallel programming models were originally designed to provide programmers with an abstract layer, hiding tedious […]

OpenCL

Sep, 19

An OpenCL design of the Bob Jenkins lookup3 hash function using the Xilinx SDAccel Development Environment

In this report, we present an OpenCL-based design of a hashing function which forms a core component of memcached [1], a distributed in-memory key-value store caching layer widely used to reduce access load between web servers and databases. Our work has been inspired by recent research investigations on dataflow architectures for key-value stores that can […]

OpenCL

Sep, 17

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs

Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed on GPGPUs for improving execution time and total throughput (amount of data processed in unit […]

CUDA

Sep, 17

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it […]

OpenCL

Sep, 17

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parametervalue combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable […]

OpenCL

Sep, 17

gSLICr: SLIC superpixels at over 250Hz

We introduce a parallel GPU implementation of the Simple Linear Iterative Clustering (SLIC) superpixel segmentation. Using a single graphic card, our implementation achieves speedups of up to 83x from the standard sequential implementation. Our implementation is fully compatible with the standard sequential implementation and the software is now available online and is open source.

CUDA

Sep, 17

Scalable Metropolis Monte Carlo for simulation of hard shapes

We design and implement HPMC, a scalable hard particle Monte Carlo simulation toolkit, and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, […]

CUDA

Sep, 15

Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems

This work presents and analyzes three convolutional neural network (CNN) models for efficient pixelwise classification of images. When using convolutional neural networks to classify single pixels in patches of a whole image, a lot of redundant computations are carried out when using sliding window networks. This set of new architectures solve this issue by either […]

CUDA

•

OpenCL

Sep, 15

linalg: Matrix Computations in Apache Spark

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark comes with the mllib.linalg library, which provides abstractions and implementations for distributed matrices. Using these abstractions, we highlight the computations that were more challenging to distribute. When translating single-node algorithms to run on a distributed cluster, we observe […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Automatic OpenCL code generation for multi-device heterogeneous architectures

Automatic Online Tuning (AutoTune): Fully Extended Analysis

Parallel Decompression of Seismic Data on GPU Using a Lifting Wavelet Algorithm

Autotuning Wavefront Patterns for Heterogeneous Architectures

An OpenCL design of the Bob Jenkins lookup3 hash function using the Xilinx SDAccel Development Environment

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

CLTune: A Generic Auto-Tuner for OpenCL Kernels

gSLICr: SLIC superpixels at over 250Hz

Scalable Metropolis Monte Carlo for simulation of hard shapes

Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems

linalg: Matrix Computations in Apache Spark

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)