A Comparison of Potential Interfaces for Batched BLAS Computations
School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK
MIMS Eprint 2016.42, 2016
@article{relton2016comparison,
title={A Comparison of Potential Interfaces for Batched BLAS Computations},
author={Relton, Samuel D and Valero-Lara, Pedro and Zounon, Mawussi},
year={2016}
}
One trend in modern high performance computing (HPC) is to decompose a large linear algebra problem into thousands of small problems which can be solved independently. There is a clear need for a batched BLAS standard, allowing users to perform thousands of small BLAS operations in parallel and making efficient use of their hardware. There are many possible ways in which the BLAS standard can be extended for batch operations. We discuss many of these possible designs, giving benefits and criticisms of each, along with a number of experiments designed to determine how the API may affect performance on modern HPC systems. Related issues that influence API design, such as the effect of memory layout on performance, are also discussed.
August 11, 2016 by hgpu