high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Robert Dietrich

Technischen Universität Dresden, Fakultät Informatik

Technischen Universität Dresden, 2019

BibTeX

Download (PDF)

View

Source

2207

views

The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. It is a recurring task as HPC systems and their software stack are regularly replaced and applications have to be ported to the new execution environment. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. The building blocks of the analysis are patterns of inefficient execution in the parallelization at process and thread level and in computation offloading. The latter is, compared to the other two, a relatively new programming model in HPC. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. The specified runtime events also enable the tracking of dependencies between tasks and thus between program regions on the host and offloaded tasks. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking highlights program regions that are relevant for program optimization. The ranking criteria is the amount of waiting time a program region caused on the critical path. The thesis concludes with a description of the performance analysis framework, its implementation, and application. The scalability is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis.

Tags: Computer science, CUDA, Data acquisition, Heterogeneous systems, MPI, nVidia, OpenACC, OpenCL, Tesla K80, Tesla V100, Thesis

December 1, 2019 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Share this:

Recent source codes

Most viewed papers (last 30 days)