Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

hgpu.org » Programming » Algorithms » Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Weile Wei

Louisiana State University

Louisiana State University, 2022

@article{wei2022optimizing,

title={Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes},

author={Wei, Weile},

year={2022}

}

Download (PDF)

View

Source

Source codes

Package:

DCA++: a state of the art implementation of the dynamical cluster approximation

1628

views

Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations per second. Software adaption and optimization are needed to address these challenges. Asynchronous many-task (AMT) systems show promise against the exascale challenge as they combine advantages of multi-core architectures with light-weight threads, asynchronous executions, smart scheduling, and portability across diverse architectures. In this research, we optimize the performance of a highly scalable scientific application using HPX, an AMT runtime system, and address its performance bottlenecks on super- computers. We use DCA++ (Dynamical Cluster Approximation) as a research vehicle for studying the performance bottlenecks in parallel and concurrent applications. DCA++ is a high-performance research software application that provides a modern C++ implementation to solve quantum many-body problems with a Quantum Monte Carlo (QMC) kernel. QMC solver applications are widely used and are mission-critical across the US Department of Energy’s (DOE’s) application landscape. Throughout the research, we implement several optimization techniques. Firstly, we add HPX threading backend support to DCA++ and achieve significant performance speedup. Secondly, we solve a memory-bound challenge in DCA++ and develop ring-based communication algorithms using GPU RDMA technology that allow much larger scientific simulation cases. Thirdly, we explore a methodology for using LLVM-based tools to tune the DCA++ that targets the new ARM A64Fx processor. We profile all implementations in-depth and observe significant performance improvement throughout all the implementations.

Tags: Algorithms, Computer science, CUDA, Heterogeneous systems, nVidia, Optimization, Package, QMC, Tesla V100, Thesis

July 3, 2022 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org