## Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms: Parallel Solvers and Preconditioners

Institut fur Angewandte und Numerische Mathematik

Institut fur Angewandte und Numerische Mathematik, 2012

@phdthesis{lukarski2012parallel,

title={Parallel Sparse Linear Algebra for Multi-core and Many-core Platforms: Parallel Solvers and Preconditioners},

author={Lukarski, Dimitar},

year={2012}

}

Partial differential equations are typically solved by means of finite difference, finite volume or finite element methods resulting in large, highly coupled, ill-conditioned and sparse (non-)linear systems. In order to minimize the computing time we want to exploit the capabilities of modern parallel architectures. The rapid hardware shifts from single core to multi-core and many-core processors lead to a gap in the progression of algorithms and programming environments for these platforms – the parallel models for large clusters do not fully utilize the performance capability of the multi-core CPUs and especially of the GPUs. Software stack needs to run adequately on the next generation of computing devices in order to exploit the potential of these new systems. Moving numerical software from one platform to another becomes an important task since every parallel device has its own programming model and language. The greatest challenge is to provide new techniques for solving (non-)linear systems that combine scalability, portability, fine-grained parallelism and flexibility across the assortment of parallel platforms and programming models. The goal of this thesis is to provide new fine-grained parallel algorithms embedded in advanced sparse linear algebra solvers and preconditioners on the emerging multi-core and many-core technologies. With respect to the mathematical methods, we focus on efficient iterative linear solvers. Here, we consider two types of solvers – out-of-the-box solvers such as preconditioned Krylov subspace solvers (e.g. CG, BiCGStab, GMRES), and problem-aware solvers such as geometric matrix-based multi-grid methods. Clearly, the majority of the solvers can be written in terms of sparse matrixvector and vector-vector operations which can be performed in parallel. Our aim is to provide parallel, generic and portable preconditioners which are suitable for multi-core and many-core devices. We focus on additive (e.g. Gauss-Seidel, SOR), multiplicative (ILU factorization with or without fill-ins) and approximate inverse preconditioners. The preconditioners can also be used as smoothing schemes in the multi-grid methods via a preconditioned defect correction step. We treat the additive splitting schemes by a multi-coloring technique to provide the necessary level of parallelism. For controlling the fill-in entries for the ILU factorization we propose a novel method which we call the power(q)-pattern method. We prove that this algorithm produces a new matrix structure with diagonal blocks containing only diagonal entries. This approach provides higher degrees of parallelism in comparison with the level-scheduling/topological sort algorithm. With these techniques we can perform the forward and backward substitution of the preconditioning step in parallel. By formulating the algorithm in block-matrix form we can execute the sweeps in parallel only by performing matrix-vector multiplications. Thus, we can express the data-parallelism in the sweeps without any specification of the underlying hardware or programming models. In object-oriented languages, an abstraction separates the object behavior from its implementation. Based on this abstraction, we have developed a linear algebra toolbox which supports several platforms such as multi-core CPUs, GPUs and accelerators. The various backends (sequential, OpenMP, CUDA, OpenCL) consist of optimized and platform-specific matrix and vector routines. Using unified interfaces across all platforms, the library allows users to build linear solvers and preconditioners without any information about the underlying hardware. With this technique, we can write our solvers and preconditioners in a single source code for all platforms. Furthermore, we can extend the library by adding new platforms without modifying the existing solvers and preconditioners. In our tests we consider two scenarios – preconditioned Krylov subspace methods and matrixbased multi-grid methods. We demonstrate speed ups in two directions: first, the preconditioners/smoothers reduce the total solution time by decreasing the number of iterations, and second, the preconditioning/smoothing phase is efficiently executed in parallel providing good scalability across several parallel architectures. We present numerical experiments and performance analysis on several platforms such as multi-core CPU and GPU devices. Furthermore, we show the viability and benefit of the proposed preconditioning schemes and software approach.

March 16, 2012 by hgpu