high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, Gagan Agrawal

Dept. of Computer Science and Engineering, The Ohio State University

The Ohio State University, Electronic report OSU-CISRC-5/12-TR11, 2012

@article{zheng2012gmprof,

title={GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs},

author={Zheng, M. and Ravi, V.T. and Ma, W. and Qin, F. and Agrawal, G.},

year={2012}

}

Download (PDF)

View

Source

1915

views

Driven by the cost-effectiveness and the power-efficiency, GPUs are being increasingly used to accelerate computations in many domains. However, developing highly efficient GPU implementations requires a lot of expertise and effort. Thus, tool support for tuning GPU programs is urgently needed, and more specifically, lowoverhead mechanisms for collecting fine-grained runtime information are critically required. Unfortunately, profiling tools and mechanisms available today either collect very coarse-grained information, or have prohibitive overheads. This paper presents a low-overhead and fine-grained profiling technique developed specifically for GPUs, which we refer to as GMProf. GMProf uses two ideas to help reduce the overheads of collecting fine-grained information. The first idea involves exploiting a number of GPU architectural features to collect reasonably accurate information very efficiently, and the second idea is to use simple static analysis methods to reduce the overhead of runtime profiling. The specific implementation of GMProf we report in this paper focuses on shared memory usage. Particularly, we help programmers understand (1) which locations in shared memory are infrequently accessed? and (2) which data elements in device memory are frequently accessed? We have evaluated GMProf using six popular GPU kernels with different characteristics. Our experimental results show that GMProf, with all optimizations, incurs a moderate overhead, e.g., 1.36 times on average for shared memory profiling. Furthermore, for three of the six evaluated kernels, GMProf verified that shared memory is effectively used, and for the remaining three kernels, it not only helped accurately identify the inefficient use of shared memory, but also helped tune the implementations. The resulting tuned implementations had a speedup of 15.18 times on average.

Tags: Computer science, CUDA, nVidia, Optimization, Performance, Tesla C1060

May 23, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Your response

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)