Autotuning OpenACC Work Distribution via Direct Search
Department of Computer Science and Software Engineering, Auburn University, AL, USA
ACM Conference on the Extreme Science and Engineering Discovery Environment (XSEDE15), 2015
@article{montgomery2015autotuning,
title={Autotuning OpenACC Work Distribution via Direct Search},
author={Montgomery, Calvin and Overbey, Jeffrey L and Li, Xuechao},
year={2015}
}
OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector length clauses, which control how a parallel workload is distributed to an accelerator’s processing units. In this paper, we present OptACC, an autotuner that can assist the programmer in selecting high-quality values for these parameters, and we evaluate the effectiveness of two direct search methods in finding solutions. We assess the quality of the the num_gangs and vector_length values found by our autotuner by comparing them to the values found by a bounded exhaustive search; we also compare the kernel execution times to those of the untuned kernel. On a suite of 36 OpenACC kernels, one or both of our autotuner’s direct search methods identified values within the top 5% for 29 of the kernels, within the top 10% for five kernels, and within the top 25% for the remaining two. Eleven of the kernels achieved a speedup greater than 2x over the compiler’s defaults, and the autotuner required only 7-11 runs of the target program, on average.
July 8, 2015 by hgpu