Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions

hgpu.org » Applications » Computer science » Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions

Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions

Jens Breitbart, Claudia Fohry

Research Group Programming Languages / Methodologies, University of Kassel, Wilhelmshoher Allee 73, 34121 Kassel, Germany

International Journal of Networking and Computing, North America, 1, 2011

@article{breitbart2011analyzing,

title={Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions},

author={Breitbart, J. and Fohry, C.},

journal={International Journal of Networking and Computing},

volume={1},

number={1},

pages={pp–114},

year={2011}

}

Download (PDF)

View

Source

2146

views

Current processor architectures are diverse and heterogeneous. Examples include multicore chips, GPUs and the Cell Broadband Engine (CBE). The recent Open Compute Language (OpenCL) standard aims at efficiency and portability. This paper explores its efficiency when implemented on the CBE, without using CBE-specific features such as explicit asynchronous memory transfers. We based our experiments on two applications: matrix multiplication, and the client side of the Einstein@Home distributed computing project. Both were programmed in OpenCL, and then translated to the CBE. For matrix multiplication, we deployed different levels of OpenCL performance optimization, and observed that they pay off on the CBE. For Einstein@Home, our translated OpenCL version achieves almost the same speed as a native CBE version. We experimented with two versions of the OpenCL to CBE mapping, in which the PPE component of the CBE does or does not take the role of a compute unit. Another major contribution of the paper is a proposal for two OpenCL extensions that we analyzed for both CBE and NVIDIA GPUs. First, we suggest an additional memory level in OpenCL, called static local memory. With little programming expense, it can lead to significant speedups such as for reduction a factor of seven on the CBE and about 20% on NVIDIA GPUs. Second, we introduce static work-groups to support user-defined mappings of tasks. Static work-groups may simplify programming and lead to speedups of 35% (CBE) and 100% (GPU) for all-parallel-prefix-sums.

Tags: Cell processor, Computer science, Distributed computing, Heterogeneous systems, Matrix multiplication, nVidia, nVidia GeForce GTX 280, OpenCL, Optimization, Playstation

September 24, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org