Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

hgpu.org » Programming » Algorithms » Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Emanuel H. Rubensson, Elias Rudberg

Division of Scientific Computing, Department of Information Technology, Uppsala University, Box 337, SE-751 05 Uppsala, Sweden

arXiv:1501.07800 [cs.DC], (30 Jan 2015)

@article{rubensson2015localityaware,

title={Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model},

author={Rubensson, Emanuel H. and Rudberg, Elias},

year={2015},

month={jan},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1405

views

We present a library for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. The library is based on the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. Acting as matrix library developers, using this model we do not have to explicitly deal with distribution of work and data or communication between computational nodes in a distributed memory cluster. The mapping of work and data to physical resources is instead handled by a Chunks and Tasks library. We evaluate our matrix library using the Chunks and Tasks library CHT-MPI, see www.chunks-and-tasks.org, which dynamically distributes data and work between nodes. This allows for exploitation of a priori unknown and irregular sparsity patterns to improve performance. On each node, available Graphics Processing Units (GPUs) are used for matrix-matrix multiplication tasks. When all GPUs are busy, tasks are also running on the available CPUs, thus making use of the full computing capacity of each node. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity may occur at any level in the hierarchy and/or within the submatrix leaves. Overlap between computation and communication is achieved for both the data transfers between nodes in the cluster and for the data transfers to/from GPUs. We evaluate the performance of the matrix-matrix multiplication implementations for dense and banded matrices and matrices with randomly distributed submatrix blocks. We demonstrate the performance of the symmetric matrix square operation, taking advantage of multiplicand and product matrices being symmetric. We also present symmetric matrix square calculations for matrices from linear scaling electronic structure calculations. The sparse symmetric matrix square is a major computational challenge for such calculations.

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, MPI, nVidia, Sparse matrix, Tesla K20

February 2, 2015 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org