Guided Profiling for Auto-Tuning Array Layouts on GPUs
TU Darmstadt
PMBS 2015
@inproceedings{Weber:2015:GPA:2832087.2832093,
author={Weber, Nicolas and Amend, Sandra C. and Goesele, Michael},
title={Guided Profiling for Auto-tuning Array Layouts on GPUs},
booktitle={Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems},
series={PMBS ’15},
year={2015},
isbn={978-1-4503-4009-0},
location={Austin, Texas},
pages={9:1–9:11},
articleno={9},
numpages={11},
url={http://doi.acm.org/10.1145/2832087.2832093},
doi={10.1145/2832087.2832093},
acmid={2832093},
publisher={ACM},
address={New York, NY, USA}
}
Auto-tuning for Graphics Processing Units (GPUs) has become very popular in recent years. It removes the necessity
to hand-tune GPU code especially when a new hardware architecture is released. Our auto-tuner optimizes memory access patterns. This is a key aspect to exploit the full performance of modern GPUs. As the memory hierarchy has historically changed in nearly every GPU generation, it was necessary to reoptimize the code for all of these new architectures. Unfortunately, the solution space for memory optimizations in large applications can easily reach millions of configurations for a single kernel. This vast number of implementations cannot be fully evaluated in a feasible time. In this paper we present an adaptive profiling algorithm that aims at finding a near optimal configuration within a fraction of the global optimum, while reducing the profiling time by several orders of magnitude compared to an exhaustive search. Our algorithm is aimed at and evaluated on large real-world applications.
to hand-tune GPU code especially when a new hardware architecture is released. Our auto-tuner optimizes memory access patterns. This is a key aspect to exploit the full performance of modern GPUs. As the memory hierarchy has historically changed in nearly every GPU generation, it was necessary to reoptimize the code for all of these new architectures. Unfortunately, the solution space for memory optimizations in large applications can easily reach millions of configurations for a single kernel. This vast number of implementations cannot be fully evaluated in a feasible time. In this paper we present an adaptive profiling algorithm that aims at finding a near optimal configuration within a fraction of the global optimum, while reducing the profiling time by several orders of magnitude compared to an exhaustive search. Our algorithm is aimed at and evaluated on large real-world applications.
February 9, 2016 by mergian