18854

Loop Perforation in OpenACC

Ahmad Lashgar, Ehsan Atoofian, Amirali Baniasadi
Electrical and Computer Engineering Department, University of Victoria, Victoria, BC, Canada
16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2018

@inproceedings{lashgar2018loop,

   title={Loop Perforation in OpenACC},

   author={Lashgar, Ahmad and Atoofian, Ehsan and Baniasadi, Amirali},

   booktitle={2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)},

   pages={163–170},

   year={2018},

   organization={IEEE}

}

High-level programming models such as OpenMP and OpenACC are used to accelerate loop-parallelizable applications. In such applications, a very large number of loop iterations are launched as threads on the accelerator, where every iteration executes the same code sequence (loop body or kernel) but on different data. In such workloads, similarities in the input lead to wide similarities in the outputs. Motivated by this observation, we propose to run only a subset of loop iterations, accurately calculating some outputs and approximating the rest. To this end, we propose employing a new directive in OpenACC to trade off performance for accuracy by perforating loop iterations. The directive is only applicable to parallel loops with a perforation rate adjusted by the programmer. Moreover, we investigate the quality and runtime impact of this directive. In summary, first, we show that naïvely applying loop perforation to OpenACC, degrades performance significantly. This is because OpenACC parallel loops are often output-parallelized and every iteration calculates one single entry in the output. Consequently, dropping k iterations leads to k erroneous entries in the output. Second and in order to address this we propose an efficient lowoverhead mechanism to recover the value of these missing output entries. Third, we show that due to the SIMD organization of accelerators, perforation does not always translate to runtime improvements. Our study shows that perforation can change the memory coalescing behavior and negatively impact runtime. In order to provide better insight we present workload characteristics that benefit from perforation the most. Our evaluations using a diverse set of benchmarks indicate that our proposed technique can improve performance up to 93%, while maintaining the quality loss at a rate below 10%.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: