26974

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Weile Wei
Louisiana State University
Louisiana State University, 2022

@article{wei2022optimizing,

   title={Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes},

   author={Wei, Weile},

   year={2022}

}

Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations per second. Software adaption and optimization are needed to address these challenges. Asynchronous many-task (AMT) systems show promise against the exascale challenge as they combine advantages of multi-core architectures with light-weight threads, asynchronous executions, smart scheduling, and portability across diverse architectures. In this research, we optimize the performance of a highly scalable scientific application using HPX, an AMT runtime system, and address its performance bottlenecks on super- computers. We use DCA++ (Dynamical Cluster Approximation) as a research vehicle for studying the performance bottlenecks in parallel and concurrent applications. DCA++ is a high-performance research software application that provides a modern C++ implementation to solve quantum many-body problems with a Quantum Monte Carlo (QMC) kernel. QMC solver applications are widely used and are mission-critical across the US Department of Energy’s (DOE’s) application landscape. Throughout the research, we implement several optimization techniques. Firstly, we add HPX threading backend support to DCA++ and achieve significant performance speedup. Secondly, we solve a memory-bound challenge in DCA++ and develop ring-based communication algorithms using GPU RDMA technology that allow much larger scientific simulation cases. Thirdly, we explore a methodology for using LLVM-based tools to tune the DCA++ that targets the new ARM A64Fx processor. We profile all implementations in-depth and observe significant performance improvement throughout all the implementations.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: