## Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Division of Scientific Computing, Department of Information Technology, Uppsala University, Box 337, SE-751 05 Uppsala, Sweden

arXiv:1501.07800 [cs.DC], (30 Jan 2015)

@article{rubensson2015localityaware,

title={Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model},

author={Rubensson, Emanuel H. and Rudberg, Elias},

year={2015},

month={jan},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

We present a library for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. The library is based on the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. Acting as matrix library developers, using this model we do not have to explicitly deal with distribution of work and data or communication between computational nodes in a distributed memory cluster. The mapping of work and data to physical resources is instead handled by a Chunks and Tasks library. We evaluate our matrix library using the Chunks and Tasks library CHT-MPI, see www.chunks-and-tasks.org, which dynamically distributes data and work between nodes. This allows for exploitation of a priori unknown and irregular sparsity patterns to improve performance. On each node, available Graphics Processing Units (GPUs) are used for matrix-matrix multiplication tasks. When all GPUs are busy, tasks are also running on the available CPUs, thus making use of the full computing capacity of each node. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity may occur at any level in the hierarchy and/or within the submatrix leaves. Overlap between computation and communication is achieved for both the data transfers between nodes in the cluster and for the data transfers to/from GPUs. We evaluate the performance of the matrix-matrix multiplication implementations for dense and banded matrices and matrices with randomly distributed submatrix blocks. We demonstrate the performance of the symmetric matrix square operation, taking advantage of multiplicand and product matrices being symmetric. We also present symmetric matrix square calculations for matrices from linear scaling electronic structure calculations. The sparse symmetric matrix square is a major computational challenge for such calculations.

February 2, 2015 by hgpu