## Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Darmstadt, Technische Universität

Darmstadt, Technische Universität, 2021

@article{thurck2021irregularity,

title={Irregularity mitigation and portability abstractions for accelerated sparse matrix factorization},

author={Th{"u}rck, Daniel},

year={2021},

publisher={Technische Universit{"a}t}

}

In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse set of compute accelerator architectures through our novel compiler borG. Be it weather prediction, climate models, personalized medicine, genetic analysis and autonomous driving: some of today’s central challenges require processing of vast amounts of data, feeding large-scale simulations or AI models. As the scale of these problems outpaces the processing power and available storage capacity, it becomes crucial to exploit their inherent sparsity. Such sparse topologies, i.e., graph topologies where most of the nodes are not directly connected, are often the source for sparse linear systems of equations whose solution poses a major computational challenge. At the same time, we are witnessing a shift in terms of hardware in the high-performance computing field: as hardware designers try to avoid the quadratically increasing energy consumption for higher clock frequencies, compute setups increase parallelism and specialization instead. Notably, most of the accelerators in use today are optimized for massive parallelism on regular structures and dense data structures. Processing sparse workloads efficiently on novel, heterogeneous architectures presents a challenge that demands systemic solutions. In this thesis, we investigate strategies and systems focusing on an important building block for computational sciences: sparse numerical (matrix) factorizations. Factorizations exhibit irregularity in two aspects. First, the sparse data structures complicate workload distribution on accelerators geared towards regular grids. Second, numerically mandated pivoting introduces irregularity into the control flow. This leads to expensive synchronization points and requires expensive re-building of data structures. We propose two building blocks that help mitigate these problems for accelerators. First, a generalization of sparse factorizations to block-sparse matrices, leading to the use of batched, heavily templated compute kernels. Such kernels are relatively simple and can be tuned for the accelerator architecture in question. Second, we propose a data structure for block-sparse matrices that enables global pivoting through parallel index modifications. Additionally, we demonstrate how pivoting can be introduced into register-focused GPU kernels, leading to a two-level, threshold a-posteriori pivoting scheme. Both concepts are validated on implementations of sparse LDLt factorizations for GPUs. Once we extend the block-sparse approach to other architectures, we risk maintaining divergent, device-specific code bases for batched kernels. Therefore, we present the source-to-source compiler borG. Based on a novel intermediate representation unifying two distinct parallel architectures programming models, borG compiles OpenCL code that uses a generalization of the warp register cache idiom, to AVX512-based CPUs, NVIDIA GPUs and NEC’s SX-Aurora vector processor. The generated kernels may be specialized for a specific problem size, e.g., processing a batch of m*n matrices, and compare favorably to hand-coded kernels in the systems’ native development stack. Building on work so far, we extend the concept of block-sparse decomposition and generalize it into a meta-algorithm. Motivated by the rise of domain-specific, fixed function accelerators, we simplify the device-side code for factorizations even further. The resulting concept, METAPACK, can be implemented not just on more traditional compute accelerators, but also on "non-Neumann" types of hardware. The latter includes systolic arrays and FPGAs. Relying only on host-side pipelining and batched kernel submission, we lay out a blueprint of a versatile factorization generator. As a full-stack system, METAPACK will allow rapid creation of customized, parameterized, sparse factorization code across many systems that support batched or pipelined parallelism. Several experiments confirm the validity of the concept and encourage a full-fledged implementation. For the sake of simplicity, METAPACK regularizes sparse matrices by subdivision into a regular grid. In order to reduce the memory waste and exploit compute resources more efficiently, accelerators would need native support for work items of different sizes. By expanding on concepts from borG, we propose such support. We combine an incremental change of current accelerators’ architectures with a novel compiler pass to simplify such irregular scheduling on the hardware side. A prototypical software simulation shows this architectures’ potential to simplify support for irregular compute loads in the future. In summary, our work can help to improve and simplify the handling of sparse matrix factorizations and their efficiency on acclerators. Through a block-centric decomposition of the factorization process and a simplification of the primitive numerical kernels, we enable the use of novel and future accelerator architectures for the sparse problems of computational science.

May 9, 2021 by hgpu