14590

Posts

Sep, 19

Automatic OpenCL code generation for multi-device heterogeneous architectures

Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially nonuniform domain […]
Sep, 19

Automatic Online Tuning (AutoTune): Fully Extended Analysis

The AutoTune project develops the Periscope Tuning Framework (PTF) including several plugins targeting performance improvements as well as to reduce energy consumption of applications. One of the main advantages of PTF over other tuning frameworks is its capability to combine tuning and analysis strategies to simplify and speed up the tuning process. To support the […]
Sep, 19

Parallel Decompression of Seismic Data on GPU Using a Lifting Wavelet Algorithm

Subsurface images are widely used by the oil companies to find oil reservoirs. The construction of these images involves to collect and process a huge amount of seismic data. Generally, the oil companies use compression algorithms to reduce the storage and transmission costs. Currently, the compression process is developed on-site using CPU architectures, whereas the […]
Sep, 19

Autotuning Wavefront Patterns for Heterogeneous Architectures

Manual tuning of applications for heterogeneous parallel systems is tedious and complex. Optimizations are often not portable, and the whole process must be repeated when moving to a new system, or sometimes even to a different problem size. Pattern based parallel programming models were originally designed to provide programmers with an abstract layer, hiding tedious […]
Sep, 19

An OpenCL design of the Bob Jenkins lookup3 hash function using the Xilinx SDAccel Development Environment

In this report, we present an OpenCL-based design of a hashing function which forms a core component of memcached [1], a distributed in-memory key-value store caching layer widely used to reduce access load between web servers and databases. Our work has been inspired by recent research investigations on dataflow architectures for key-value stores that can […]
Sep, 17

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs

Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed on GPGPUs for improving execution time and total throughput (amount of data processed in unit […]
Sep, 17

SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it […]
Sep, 17

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parametervalue combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable […]
Sep, 17

Scalable Metropolis Monte Carlo for simulation of hard shapes

We design and implement HPMC, a scalable hard particle Monte Carlo simulation toolkit, and release it open source as part of HOOMD-blue. HPMC runs in parallel on many CPUs and many GPUs using domain decomposition. We employ BVH trees instead of cell lists on the CPU for fast performance, especially with large particle size disparity, […]
Sep, 17

gSLICr: SLIC superpixels at over 250Hz

We introduce a parallel GPU implementation of the Simple Linear Iterative Clustering (SLIC) superpixel segmentation. Using a single graphic card, our implementation achieves speedups of up to 83x from the standard sequential implementation. Our implementation is fully compatible with the standard sequential implementation and the software is now available online and is open source.
Sep, 15

linalg: Matrix Computations in Apache Spark

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark comes with the mllib.linalg library, which provides abstractions and implementations for distributed matrices. Using these abstractions, we highlight the computations that were more challenging to distribute. When translating single-node algorithms to run on a distributed cluster, we observe […]
Sep, 15

Refinements in Syntactic Parsing

Syntactic parsing is one of the core tasks of natural language processing, with many appli- cations in downstream NLP tasks, from machine translation and summarization to relation extraction and coreference resolution. Parsing performance on English texts, particularly well-edited newswire text, is generally regarded as quite good. However, state-of-the-art constituency parsers produce incorrect parses for more […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org