Posts
Aug, 18
SkePU: a multi-backend skeleton programming library for multi-GPU systems
We present SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU […]
Aug, 18
Energy-aware metrics for benchmarking heterogeneous systems
With the advent of heterogeneous computing systems consisting of multi-core CPUs and many-core GPUs, robust methods are needed to facilitate fair benchmark comparisons between different systems. In this paper we present a benchmarking methodology for measuring a number of performance metrics for heterogeneous systems. Methods for comparing performance and energy efficiency are included. Consideration is […]
Aug, 18
ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs
Modern GPUs have been shown to be highly efficient machines for data-parallel applications such as graphics, image, video processing, or physical simulation applications. For example, a single ATI Radeon HD 5870 GPU has a theoretical peak of 2.72 teraflops (1012 floating-point operations per second) with a video memory bandwidth of 153.6 GB/s. While it is […]
Aug, 18
Physical and graphical effects in OpenCL by example
There are strong indications that the future of interactive graphics involves a more flexible programming model than today’s OpenGL/Direct3D pipelines. That means that graphics developers will need a basic understanding of how to combine emerging parallel-programming techniques with the traditional interactive rendering pipeline. This course provides an introduction to parallel-programming architectures and environments for interactive […]
Aug, 18
Parallelization of the x264 encoder using OpenCL
With the introduction of H.264, the complexity on video encoders has increased dramatically. As hardware based encoding solutions profit from the strict sequential design and already feature real time capabilities for high definition material, software solutions lack most of the encoding performance. More precisely, the performance of software encoders is limited due to the computation […]
Aug, 18
Simulating Biological-Inspired Spiking Neural Networks with OpenCL
The algorithms used for simulating biologically-inspired spiking neural networks (BIANN) often utilize functions which are computationally complex and have to model a large number of neurons – or even a much larger number of synapses in parallel. To use all available computing resources provided by a standard desktop PC is an opportunity to shorten the […]
Aug, 18
Parallel Batch Training of the Self-Organizing Map Using OpenCL
The Self-Organizing Maps (SOMs) are popular artificial neural networks that are often used for data analyses through clustering and visualisation. SOM’s mathematical model is inherently parallel. However, many implementations have not successfully exploited its parallelism because previous attempts often required cluster-like infrastructures. This article presents the parallel implementation of SOMs, particularly the batch map variant […]
Aug, 18
Maestro: Data Orchestration and Tuning for OpenCL Devices
As heterogeneous computing platforms become more prevalent, the programmer must account for complex memory hierarchies in addition to the difficulties of parallel programming. OpenCL is an open standard for parallel computing that helps alleviate this difficulty by providing a portable set of abstractions for device memory hierarchies. However, OpenCL requires that the programmer explicitly controls […]
Aug, 18
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL
In this paper, we present OMPSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on three different architectures, SMP, Cell/B.E. and GPUs, showing the wide usefulness of the approach. The evaluation is done with four different benchmarks, Matrix Multiply, BlackScholes, […]
Aug, 18
Analyzing program flow within a many-kernel OpenCL application
Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and […]
Aug, 17
Near real-time Fast Bilateral Stereo on the GPU
State of the art local stereo correspondence algorithms that adapt their supports to image content allow to infer very accurate disparity maps often comparable to algorithms based on global disparity optimization methods. However, despite their effectiveness, accurate local approaches based on this methodology are also computationally expensive and several simplifications aimed at reducing their computational […]
Aug, 17
Fast boosting trees for classification, pose detection, and boundary detection on a GPU
Discriminative classifiers are often the computational bottleneck in medical imaging applications such as foreground/background classification, 3D pose detection, and boundary delineation. To overcome this bottleneck, we propose a fast technique based on boosting tree classifiers adapted for GPU computation. Unlike standard tree-based algorithms, our method does not have any recursive calls which makes it GPU-friendly. […]