OpenACC cache Directive: Opportunities and Optimizations
Electrical and Computer Engineering Department, University of Victoria
Third Workshop on Accelerator Programming Using Directives (WACCPD), 2016
@inproceedings{lashgar2016openacc,
title={OpenACC cache directive: opportunities and optimizations},
author={Lashgar, Ahmad and Baniasadi, Amirali},
booktitle={Proceedings of the Third International Workshop on Accelerator Programming Using Directives},
pages={46–56},
year={2016},
organization={IEEE Press}
}
OpenACC’s programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimize for performance. Among the difficult to implement directives, is the cache directive. The cache directive allows the programmer to utilize accelerator’s hardware- or software-managed caches by passing hints to the compiler. In this paper, we investigate the implementation aspect of cache directive under NVIDIA-like GPUs and propose optimizations for the CUDA backend. We use CUDA’s shared memory as the software-managed cache space. We first show that a straightforward implementation can be very inefficient, and downgrade performance. We investigate the differences between this implementation and hand-written CUDA alternatives and introduce the following optimizations to bridge the performance gap between the two: i) improving occupancy by sharing the cache among several parallel threads and ii) optimizing cache fetch and write routines via parallelization and minimizing control flow. We present compiler passes to apply these optimizations. Investigating three test cases, we show that the best cache directive implementation can perform very close to hand-written CUDA equivalent and improve performance up to 2.18X (compared to the baseline OpenACC.)
December 3, 2016 by hgpu