GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

hgpu.org » Applications » Computer science » GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, Gagan Agrawal

Dept. of Computer Science and Engineering, The Ohio State University

The Ohio State University, Electronic report OSU-CISRC-5/12-TR11, 2012

BibTeX

Download (PDF)

View

Source

1879

views

Driven by the cost-effectiveness and the power-efficiency, GPUs are being increasingly used to accelerate computations in many domains. However, developing highly efficient GPU implementations requires a lot of expertise and effort. Thus, tool support for tuning GPU programs is urgently needed, and more specifically, lowoverhead mechanisms for collecting fine-grained runtime information are critically required. Unfortunately, profiling tools and mechanisms available today either collect very coarse-grained information, or have prohibitive overheads. This paper presents a low-overhead and fine-grained profiling technique developed specifically for GPUs, which we refer to as GMProf. GMProf uses two ideas to help reduce the overheads of collecting fine-grained information. The first idea involves exploiting a number of GPU architectural features to collect reasonably accurate information very efficiently, and the second idea is to use simple static analysis methods to reduce the overhead of runtime profiling. The specific implementation of GMProf we report in this paper focuses on shared memory usage. Particularly, we help programmers understand (1) which locations in shared memory are infrequently accessed? and (2) which data elements in device memory are frequently accessed? We have evaluated GMProf using six popular GPU kernels with different characteristics. Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. Furthermore, for three of the six evaluated kernels, GMProf verified that shared memory is effectively used, and for the remaining three kernels, it not only helped accurately identify the inefficient use of shared memory, but also helped tune the implementations. The resulting tuned implementations had a speedup of 15.18 times on average.

Tags: Computer science, CUDA, nVidia, Optimization, Performance, Tesla C1060

May 23, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org