high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Performance Portability in Accelerated Parallel Kernels

Performance Portability in Accelerated Parallel Kernels

John A. Stratton, Hee-Seok Kim, Thoman B. Jablin, Wen-Mei W. Hwu

University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing

University of Illinois at Urbana-Champaign, IMPACT Technical Report IMPACT-13-01, 2013

@techreport{stratton:12:parboil2.5,

author={Stratton, John A. and Kim, Hee-Seok and Jablin, Thomas B. and Hwu, Wen-Mei W.},

title={Performance Portability in Accelerated Parallel Kernels},

institution={University of Illinois at Urbana-Champaign},

number={IMPACT-13-01},

address={Urbana},

year={2013},

month={may}

}

Download (PDF)

View

Source

1494

views

Heterogeneous architectures, by definition, include multiple processing components with very different microarchitectures and execution models.In particular, computing platforms from supercomputers to smartphones can now incorporate both CPU and GPU processors. Disparities between CPU and GPU processor architectures have naturally led to distinct programming models and development patterns for each component.Developers for a specific system decompose their application, assign different parts to different heterogeneous components, and express each part in its assigned component’s native model.But without additional effort, that application will not be suitable for another architecture with a different heterogeneous component balance.Developers addressing a variety of platforms must either write multiple implementations for every potential heterogeneous component or fall back to a "safe" CPU implementation, incurring a high development cost or loss of system performance, respectively.The disadvantages of developing for heterogeneous systems are vastly reduced if one source code implementation can be mapped to either a CPU or GPU architecture with high performance. A convention has emerged from the OpenCL community defining how to write kernels for performance portability among different GPU architectures.This paper demonstrates that OpenCL programs written according to this convention contain enough abstract performance information to enable effective translations to CPU architectures as well.The challenge is that an OpenCL implementation must focus on those programming conventions more than the most natural mapping of the language specification to the target architecture.In particular, prior work implementing OpenCL on CPU platforms neglects the OpenCL kernel’s implicit expression of performance properties such as spatial or temporal locality.We outline some concrete transformations that can be applied to an OpenCL kernel to suitably map the abstract performance properties to CPU execution constructs. We show that such transformations result in marked performance improvements over existing CPU OpenCL implementations for GPU-portable OpenCL kernels.Ultimately, we show that the performance of GPU-portable OpenCL kernels, when using our methodology, is comparable to the performance of native multicore CPU programming models such as OpenMP.

Tags: Computer science, Heterogeneous systems, OpenCL, OpenMP, Performance

May 27, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Performance Portability in Accelerated Parallel Kernels

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Performance Portability in Accelerated Parallel Kernels

Share this:

Recent source codes

Most viewed papers (last 30 days)