Optimising the DBCSR GPU Implementation
The University of Edinburgh
The University of Edinburgh, 2011
@article{chetty2011optimising,
title={Optimising the DBCSR GPU Implementation},
author={Chetty, J.},
year={2011}
}
The DBCSR library solves the sparse matrix multiplication required to perform atomistic simulations using the CP2K software. The GPU implementation of DBCSR was targeted for optimisation, and having its scope increased to allow it to function with larger block sizes. It was found that the main kernel could be sped up by 16% by augmenting the algorithm so multiple elements were assigned to each thread. By assigning each thread block its own local C matrix, the need for locks on the C matrix was removed. The cost of the required reduction step, however, outweighed the benefit of the lock removal. The Cublas dgemm function showed that it is a suitable candidate to handle block sizes too large for the original method to process.
December 31, 2011 by hgpu