Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework
Technischen Universität Dresden, Fakultät Informatik
Technischen Universität Dresden, 2019
@article{dietrich2019scalable,
title={Scalable Applications on Heterogeneous System Architectures},
author={Dietrich, Robert},
year={2019}
}
The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. It is a recurring task as HPC systems and their software stack are regularly replaced and applications have to be ported to the new execution environment. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. The building blocks of the analysis are patterns of inefficient execution in the parallelization at process and thread level and in computation offloading. The latter is, compared to the other two, a relatively new programming model in HPC. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. The specified runtime events also enable the tracking of dependencies between tasks and thus between program regions on the host and offloaded tasks. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking highlights program regions that are relevant for program optimization. The ranking criteria is the amount of waiting time a program region caused on the critical path. The thesis concludes with a description of the performance analysis framework, its implementation, and application. The scalability is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis.
December 1, 2019 by hgpu