high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Daniel Thürck

Darmstadt, Technische Universität

Darmstadt, Technische Universität, 2021

DOI:10.26083/tuprints-00017951

BibTeX

Download (PDF)

View

Source

1776

views

In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse set of compute accelerator architectures through our novel compiler borG. Be it weather prediction, climate models, personalized medicine, genetic analysis and autonomous driving: some of today’s central challenges require processing of vast amounts of data, feeding large-scale simulations or AI models. As the scale of these problems outpaces the processing power and available storage capacity, it becomes crucial to exploit their inherent sparsity. Such sparse topologies, i.e., graph topologies where most of the nodes are not directly connected, are often the source for sparse linear systems of equations whose solution poses a major computational challenge. At the same time, we are witnessing a shift in terms of hardware in the high-performance computing field: as hardware designers try to avoid the quadratically increasing energy consumption for higher clock frequencies, compute setups increase parallelism and specialization instead. Notably, most of the accelerators in use today are optimized for massive parallelism on regular structures and dense data structures. Processing sparse workloads efficiently on novel, heterogeneous architectures presents a challenge that demands systemic solutions. In this thesis, we investigate strategies and systems focusing on an important building block for computational sciences: sparse numerical (matrix) factorizations. Factorizations exhibit irregularity in two aspects. First, the sparse data structures complicate workload distribution on accelerators geared towards regular grids. Second, numerically mandated pivoting introduces irregularity into the control flow. This leads to expensive synchronization points and requires expensive re-building of data structures. We propose two building blocks that help mitigate these problems for accelerators. First, a generalization of sparse factorizations to block-sparse matrices, leading to the use of batched, heavily templated compute kernels. Such kernels are relatively simple and can be tuned for the accelerator architecture in question. Second, we propose a data structure for block-sparse matrices that enables global pivoting through parallel index modifications. Additionally, we demonstrate how pivoting can be introduced into register-focused GPU kernels, leading to a two-level, threshold a-posteriori pivoting scheme. Both concepts are validated on implementations of sparse LDLt factorizations for GPUs. Once we extend the block-sparse approach to other architectures, we risk maintaining divergent, device-specific code bases for batched kernels. Therefore, we present the source-to-source compiler borG. Based on a novel intermediate representation unifying two distinct parallel architectures programming models, borG compiles OpenCL code that uses a generalization of the warp register cache idiom, to AVX512-based CPUs, NVIDIA GPUs and NEC’s SX-Aurora vector processor. The generated kernels may be specialized for a specific problem size, e.g., processing a batch of m*n matrices, and compare favorably to hand-coded kernels in the systems’ native development stack. Building on work so far, we extend the concept of block-sparse decomposition and generalize it into a meta-algorithm. Motivated by the rise of domain-specific, fixed function accelerators, we simplify the device-side code for factorizations even further. The resulting concept, METAPACK, can be implemented not just on more traditional compute accelerators, but also on "non-Neumann" types of hardware. The latter includes systolic arrays and FPGAs. Relying only on host-side pipelining and batched kernel submission, we lay out a blueprint of a versatile factorization generator. As a full-stack system, METAPACK will allow rapid creation of customized, parameterized, sparse factorization code across many systems that support batched or pipelined parallelism. Several experiments confirm the validity of the concept and encourage a full-fledged implementation. For the sake of simplicity, METAPACK regularizes sparse matrices by subdivision into a regular grid. In order to reduce the memory waste and exploit compute resources more efficiently, accelerators would need native support for work items of different sizes. By expanding on concepts from borG, we propose such support. We combine an incremental change of current accelerators’ architectures with a novel compiler pass to simplify such irregular scheduling on the hardware side. A prototypical software simulation shows this architectures’ potential to simplify support for irregular compute loads in the future. In summary, our work can help to improve and simplify the handling of sparse matrix factorizations and their efficiency on acclerators. Through a block-centric decomposition of the factorization process and a simplification of the primitive numerical kernels, we enable the use of novel and future accelerator architectures for the sparse problems of computational science.

Tags: Algorithms, Computer science, Factorization, Heterogeneous systems, nVidia, nVidia GeForce GTX Titan V, OpenCL, Sparse matrix, Thesis

May 9, 2021 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Irregularity Mitigation and Portability Abstractions for Accelerated Sparse Matrix Factorization

Share this:

Recent source codes

Most viewed papers (last 30 days)