high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Peter Thoman, Klaus Kofler, Heiko Studt, John Thomson, Thomas Fahringer

University of Innsbruck

EURO-PAR 2011, Parallel Processing, Lecture Notes in Computer Science, 2011, Volume 6853/2011, 438-452

DOI:10.1007/978-3-642-23397-5_43

@article{thoman2011automatic,

title={Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design},

author={Thoman, P. and Kofler, K. and Studt, H. and Thomson, J. and Fahringer, T.},

journal={Euro-Par 2011 Parallel Processing},

pages={438–452},

year={2011},

publisher={Springer}

}

Download (PDF)

View

Source

1864

views

The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench. We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.

Tags: Algorithms, ATI, ATI Radeon HD 5870, Benchmarking, Cell processor, Computer science, nVidia, nVidia GeForce GTX 275, nVidia GeForce GTX 460, OpenCL, Optimization, Tesla C2050

September 8, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Share this:

Recent source codes

Most viewed papers (last 30 days)