Low-Overhead Trace Collection and Profiling on GPU Compute Kernels
Polytechnique Montréal, Montréal, Canada
ACM Transactions on Parallel Computing, 2024
DOI:10.1145/3649510
@article{darche2024low,
title={Low-Overhead Trace Collection and Profiling on GPU Compute Kernels},
author={Darche, Sébastien and Dagenais, Michel R},
journal={ACM Transactions on Parallel Computing},
year={2024},
publisher={ACM New York, NY}
}
While GPUs can bring substantial speedup to compute-intensive tasks, their programming is notoriously hard. From their programming model, to microarchitectural particularities, the programmer may encounter many pitfalls which may hinder performance in obscure ways. Numerous performance analysis tools provide helpful data on the efficiency of the compute kernels, but few allow the programmer to efficiently gather runtime information directly on the device and pinpoint the sections to optimize. We propose in this paper an instrumentation method to collect traces while executing the compute kernel, with a reduced overhead compared to other approaches, by exploiting the inherently parallel behavior of GPUs and compartmentalizing tracing phases. The reference implementation is freely available and induces an average overhead of 1.6 × on a popular scientific computing benchmark and 1.5 × over the kernel execution time. This represents an improvement of an order of magnitude compared to similar work, and proves useful for timing-guided optimizations. The tool generates insightful execution traces and timestamps which can be analyzed to better understand performance issues in the kernel.
March 3, 2024 by hgpu