high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Sébastien Darche, Michel R. Dagenais

Polytechnique Montréal, Montréal, Canada

ACM Transactions on Parallel Computing, 2024

DOI:10.1145/3649510

BibTeX

Download (PDF)

View

Source

Source codes

Package:

HIP-analyzer: compiler plugin for performance analysis of HIP applications

1203

views

While GPUs can bring substantial speedup to compute-intensive tasks, their programming is notoriously hard. From their programming model, to microarchitectural particularities, the programmer may encounter many pitfalls which may hinder performance in obscure ways. Numerous performance analysis tools provide helpful data on the efficiency of the compute kernels, but few allow the programmer to efficiently gather runtime information directly on the device and pinpoint the sections to optimize. We propose in this paper an instrumentation method to collect traces while executing the compute kernel, with a reduced overhead compared to other approaches, by exploiting the inherently parallel behavior of GPUs and compartmentalizing tracing phases. The reference implementation is freely available and induces an average overhead of 1.6 × on a popular scientific computing benchmark and 1.5 × over the kernel execution time. This represents an improvement of an order of magnitude compared to similar work, and proves useful for timing-guided optimizations. The tool generates insightful execution traces and timestamps which can be analyzed to better understand performance issues in the kernel.

Tags: AMD Radeon Instinct MI100, ATI, Benchmarking, Computer science, CUDA, HIP, nVidia, Package, Performance, Profiling

March 3, 2024 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)