high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs

Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs

Deepthi Nandakumar

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign, 2011

@article{hwu2011automatic,

title={Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs},

author={Hwu, W.M.W.},

year={2011}

}

Download (PDF)

View

Source

3525

views

As an open, royalty-free framework for writing programs that execute across heterogeneous platforms, OpenCL gives programmers access to a variety of data parallel processors including CPUs, GPUs, the Cell and DSPs. All OpenCL-compliant implementations support a core specification, thus ensuring robust functional portability of any OpenCL program. This thesis presents the CUDAtoOpenCL source-to-source tool that translates code from CUDA to OpenCL, thus ensuring portability of applications on a variety of devices. However, current compiler optimizations are not sufficient to translate performance from a single expression of the program onto a wide variety of different architectures. To achieve true performance portability, an open standard like OpenCL needs to be augmented with automatic high-level optimization and transformation tools, which can generate optimized code and configurations for any target device. This thesis presents details of the working and implementation of the CUDAtoOpenCL translator, based on the Cetus compiler framework. This thesis also describes key insights from our studies optimizing selected benchmarks for two distinct GPU architectures: the NVIDIA GTX280 and the ATI Radeon HD 5870. It can be concluded from the generated results that the type and degree of optimization applied to each benchmark need to be adapted to the target architecture specifications. In particular, the different hardware architectures of the basic compute unit, register file organization, on-chip memory limitations, DRAM coalescing patterns and floating point unit throughput of the two devices interact with each optimization differently.

Tags: ATI, ATI Radeon HD 5870, Benchmarking, Code generation, Computer science, CUDA, Heterogeneous systems, nVidia, nVidia GeForce GTX 280, OpenCL, Optimization, Thesis

September 24, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Automatic Translation of CUDA to OpenCL and Comparison of Performance Optimizations on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)