## Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes

Departament de Maquines i Motors Termics, E.T.S.E.I.A.T., Universitat Politecnica de Catalunya

Universitat Politecnica de Catalunya, 2015

@article{altamirano2015heterogeneous,

title={Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes},

author={Altamirano, Guillermo Andr{‘e}s Oyarz{‘u}n},

year={2015}

}

Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On the other hand, multi-core CPUs have turned into energy efficient system-on-chip architectures. By doing so, the main components of the node are fused and integrated into a single chip reducing the energy costs. Nowadays, several institutions and governments are investing in the research and development of different aspects of HPC that could lead to the next generations of supercomputers. This initiatives have entitled the problem as the exascale challenge. This goal can only be achieved by incorporating major changes in computer architecture, memory design and network interfaces. The CFD community faces an important challenge: keep the pace at the rapid changes in the HPC resources. The codes and formulations need to be re-design in other to exploit the different levels of parallelism and complex memory hierarchies of the new heterogeneous systems. The main characteristics demanded to the new CFD software are: memory awareness, extreme concurrency, modularity and portability. This thesis is devoted to the study of a CFD algorithm re-factoring for the adoption of new technologies. Our application context is the solution of incompressible flows (DNS or LES) on unstructured meshes. The first approach was using GPUs for accelerating the Poisson solver, that is the most computational intensive part of our application. The positive results obtained in this first step motivated us to port the complete time integration phase of our application. This requires a major redesign of the code. We propose a portable implementation model for CFD applications. The main idea was substituting stencil data structures and kernels by algebraic storage formats and operators. By doing so, the algorithm was restructured into a minimal set of algebraic operations. The implementation strategy consisted in the creation of a low-level algebraic layer for computations on CPUs and GPUs, and a high-level user-friendly discretization layer for CPUs that is fully localized at the preprocessing stage where performance does not play an important role. As a result, at the time-integration phase the code relies only on three algebraic kernels: sparse-matrix-vector product (SpMV), linear combination of two vectors (AXPY) and dot product (DOT). Such a simple set of basic linear algebra operations naturally provides the desired portability to any computing architecture. Special attention was paid at the development of data structures compatibles with the stream processing model. A detailed performance analysis was studied in both sequential and parallel execution engaging up to 128 GPUs in a hybrid CPU/GPU supercomputer. Moreover, we tested the portable implementation model of TermoFluids code in the Mont-Blanc mobile-based supercomputer. The re-design of the kernels exploits a heterogeneous execution model using both computing devices CPU and GPU of the ARM-based nodes. The load balancing between the two computing devices exploits a tabu search strategy that tunes the workload distribution during the preprocessing stage. A comparison of the Mont-Blanc prototypes with high-end supercomputers in terms of the achieved net performance and energy consumption provided some guidelines of the behavior of CFD applications in ARM-based architectures. Finally, we present a memory aware auto-tuned Poisson solver for problems with one Fourier diagonalizable direction. This work was developed and tested in the BlueGene/Q Vesta supercomputer, and aims at demonstrating the relevance of vectorization and memory awareness for fully exploiting the modern energy efficient CPUs.

March 5, 2016 by hgpu