high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » OpenACC cache Directive: Opportunities and Optimizations

OpenACC cache Directive: Opportunities and Optimizations

Ahmad Lashgar, Amirali Baniasadi

Electrical and Computer Engineering Department, University of Victoria

Third Workshop on Accelerator Programming Using Directives (WACCPD), 2016

@inproceedings{lashgar2016openacc,

title={OpenACC cache directive: opportunities and optimizations},

author={Lashgar, Ahmad and Baniasadi, Amirali},

booktitle={Proceedings of the Third International Workshop on Accelerator Programming Using Directives},

pages={46–56},

year={2016},

organization={IEEE Press}

}

Download (PDF)

View

Source

Source codes

Package:

IPMACC: a framework for translating OpenACC for C API to CUDA, OpenCL, and Intel ISPC

1800

views

OpenACC’s programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimize for performance. Among the difficult to implement directives, is the cache directive. The cache directive allows the programmer to utilize accelerator’s hardware- or software-managed caches by passing hints to the compiler. In this paper, we investigate the implementation aspect of cache directive under NVIDIA-like GPUs and propose optimizations for the CUDA backend. We use CUDA’s shared memory as the software-managed cache space. We first show that a straightforward implementation can be very inefficient, and downgrade performance. We investigate the differences between this implementation and hand-written CUDA alternatives and introduce the following optimizations to bridge the performance gap between the two: i) improving occupancy by sharing the cache among several parallel threads and ii) optimizing cache fetch and write routines via parallelization and minimizing control flow. We present compiler passes to apply these optimizations. Investigating three test cases, we show that the best cache directive implementation can perform very close to hand-written CUDA equivalent and improve performance up to 2.18X (compared to the baseline OpenACC.)

Tags: Code generation, Computer science, CUDA, nVidia, OpenACC, OpenCL, Package, Performance, Tesla K20

December 3, 2016 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

OpenACC cache Directive: Opportunities and Optimizations

Package:

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

OpenACC cache Directive: Opportunities and Optimizations

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)