GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
Dept. of Computer Science and Engineering, The Ohio State University
The Ohio State University, Electronic report OSU-CISRC-5/12-TR11, 2012
@article{zheng2012gmprof,
title={GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs},
author={Zheng, M. and Ravi, V.T. and Ma, W. and Qin, F. and Agrawal, G.},
year={2012}
}
Driven by the cost-effectiveness and the power-efficiency, GPUs are being increasingly used to accelerate computations in many domains. However, developing highly efficient GPU implementations requires a lot of expertise and effort. Thus, tool support for tuning GPU programs is urgently needed, and more specifically, lowoverhead mechanisms for collecting fine-grained runtime information are critically required. Unfortunately, profiling tools and mechanisms available today either collect very coarse-grained information, or have prohibitive overheads. This paper presents a low-overhead and fine-grained profiling technique developed specifically for GPUs, which we refer to as GMProf. GMProf uses two ideas to help reduce the overheads of collecting fine-grained information. The first idea involves exploiting a number of GPU architectural features to collect reasonably accurate information very efficiently, and the second idea is to use simple static analysis methods to reduce the overhead of runtime profiling. The specific implementation of GMProf we report in this paper focuses on shared memory usage. Particularly, we help programmers understand (1) which locations in shared memory are infrequently accessed? and (2) which data elements in device memory are frequently accessed? We have evaluated GMProf using six popular GPU kernels with different characteristics. Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. Furthermore, for three of the six evaluated kernels, GMProf verified that shared memory is effectively used, and for the remaining three kernels, it not only helped accurately identify the inefficient use of shared memory, but also helped tune the implementations. The resulting tuned implementations had a speedup of 15.18 times on average.
May 23, 2012 by hgpu