Loop Perforation in OpenACC

hgpu.org » Applications » Computer science » Loop Perforation in OpenACC

Loop Perforation in OpenACC

Ahmad Lashgar, Ehsan Atoofian, Amirali Baniasadi

Electrical and Computer Engineering Department, University of Victoria, Victoria, BC, Canada

16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2018

BibTeX

Download (PDF)

View

Source

Source codes

Package:

IPMACC: a framework for translating OpenACC for C API to CUDA, OpenCL, and Intel ISPC

1822

views

High-level programming models such as OpenMP and OpenACC are used to accelerate loop-parallelizable applications. In such applications, a very large number of loop iterations are launched as threads on the accelerator, where every iteration executes the same code sequence (loop body or kernel) but on different data. In such workloads, similarities in the input lead to wide similarities in the outputs. Motivated by this observation, we propose to run only a subset of loop iterations, accurately calculating some outputs and approximating the rest. To this end, we propose employing a new directive in OpenACC to trade off performance for accuracy by perforating loop iterations. The directive is only applicable to parallel loops with a perforation rate adjusted by the programmer. Moreover, we investigate the quality and runtime impact of this directive. In summary, first, we show that naïvely applying loop perforation to OpenACC, degrades performance significantly. This is because OpenACC parallel loops are often output-parallelized and every iteration calculates one single entry in the output. Consequently, dropping k iterations leads to k erroneous entries in the output. Second and in order to address this we propose an efficient lowoverhead mechanism to recover the value of these missing output entries. Third, we show that due to the SIMD organization of accelerators, perforation does not always translate to runtime improvements. Our study shows that perforation can change the memory coalescing behavior and negatively impact runtime. In order to provide better insight we present workload characteristics that benefit from perforation the most. Our evaluations using a diverse set of benchmarks indicate that our proposed technique can improve performance up to 93%, while maintaining the quality loss at a rate below 10%.

Tags: Computer science, CUDA, OpenACC, OpenCL, Package, Performance

April 20, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org