high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Performance Portability in Accelerated Parallel Kernels

Performance Portability in Accelerated Parallel Kernels

John A. Stratton, Hee-Seok Kim, Thoman B. Jablin, Wen-Mei W. Hwu

University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing

University of Illinois at Urbana-Champaign, IMPACT Technical Report IMPACT-13-01, 2013

BibTeX

Download (PDF)

View

Source

1901

views

Heterogeneous architectures, by definition, include multiple processing components with very different microarchitectures and execution models.In particular, computing platforms from supercomputers to smartphones can now incorporate both CPU and GPU processors. Disparities between CPU and GPU processor architectures have naturally led to distinct programming models and development patterns for each component.Developers for a specific system decompose their application, assign different parts to different heterogeneous components, and express each part in its assigned component’s native model.But without additional effort, that application will not be suitable for another architecture with a different heterogeneous component balance.Developers addressing a variety of platforms must either write multiple implementations for every potential heterogeneous component or fall back to a "safe" CPU implementation, incurring a high development cost or loss of system performance, respectively.The disadvantages of developing for heterogeneous systems are vastly reduced if one source code implementation can be mapped to either a CPU or GPU architecture with high performance. A convention has emerged from the OpenCL community defining how to write kernels for performance portability among different GPU architectures.This paper demonstrates that OpenCL programs written according to this convention contain enough abstract performance information to enable effective translations to CPU architectures as well.The challenge is that an OpenCL implementation must focus on those programming conventions more than the most natural mapping of the language specification to the target architecture.In particular, prior work implementing OpenCL on CPU platforms neglects the OpenCL kernel’s implicit expression of performance properties such as spatial or temporal locality.We outline some concrete transformations that can be applied to an OpenCL kernel to suitably map the abstract performance properties to CPU execution constructs. We show that such transformations result in marked performance improvements over existing CPU OpenCL implementations for GPU-portable OpenCL kernels.Ultimately, we show that the performance of GPU-portable OpenCL kernels, when using our methodology, is comparable to the performance of native multicore CPU programming models such as OpenMP.

Tags: Computer science, Heterogeneous systems, OpenCL, OpenMP, Performance

May 27, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Performance Portability in Accelerated Parallel Kernels

Your response

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)

Performance Portability in Accelerated Parallel Kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)