Refining HPCToolkit for application performance analysis at exascale
Department of Computer Science, Rice University, Houston, TX, USA
The International Journal of High Performance Computing Applications, 2024
@article{adhianto2024refining,
title={Refining HPCToolkit for application performance analysis at exascale},
author={Adhianto, Laksono and Anderson, Jonathon and Barnett, Robert Matthew and Grbic, Dragana and Indic, Vladimir and Krentel, Mark and Liu, Yumeng and Milakovi{‘c}, Sr{dj}an and Phan, Wileam and Mellor-Crummey, John},
journal={The International Journal of High Performance Computing Applications},
pages={10943420241277839},
year={2024},
publisher={SAGE Publications Sage UK: London, England}
}
As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and application threads. To attribute performance information in detail to source lines, loop nests, and inlined call chains, HPCToolkit performs parallel analysis of large CPU and GPU binaries involved in the execution of an exascale application to rapidly recover mappings between machine instructions and source code. To analyze terabytes of performance measurements gathered during executions at exascale, HPCToolkit employs distributed-memory parallelism, multithreading, sparse data structures, and out-of-core streaming analysis algorithms. To support interactive exploration of profiles up to terabytes in size, HPCToolkit’s hpcviewer graphical user interface uses out-ofcore methods to visualize performance data. The result of these efforts is that HPCToolkit now supports collection, analysis, and presentation of profiles and traces of GPU-accelerated applications at exascale. These improvements have enabled HPCToolkit to efficiently measure, analyze and explore terabytes of performance data for executions using as many as 64K MPI ranks and 64K GPU tiles on ORNL’s Frontier supercomputer. HPCToolkit’s support for measurement and analysis of GPU-accelerated applications has been employed to study a collection of open-science applications developed as part of ECP. This paper reports on these experiences, which provided insight into opportunities for tuning applications, strengths and weaknesses of HPCToolkit itself, as well as unexpected behaviors in executions at exascale.
September 15, 2024 by hgpu