high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra

University of Tennessee, Knoxville

University of Tennessee, Tech. report UT-CS-10-656, 2011

@techreport{dua2011cuda,

title={From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming},

author={Dua, P. and Webera, R. and Luszczeka, P. and Tomova, S. and Petersona, G. and Dongarraa, J.},

institution={Technical Report CS-10-656, Electrical Engineering and Computer Science Department, University of Tennessee, 201 0. LAPACK Working Note 228},

year={2011}

}

Download (PDF)

View

Source

1662

views

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.

Tags: ATI, ATI Radeon HD 5870, Computer science, CUDA, Linear Algebra, Matrix multiplication, nVidia, OpenCL, Optimization, Performance, Tesla C2050

September 26, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

Share this:

Recent source codes

Most viewed papers (last 30 days)