Optimising the DBCSR GPU Implementation

Jay Chetty
The University of Edinburgh
The University of Edinburgh, 2011


   title={Optimising the DBCSR GPU Implementation},

   author={Chetty, J.},



Download Download (PDF)   View View   Source Source   



The DBCSR library solves the sparse matrix multiplication required to perform atomistic simulations using the CP2K software. The GPU implementation of DBCSR was targeted for optimisation, and having its scope increased to allow it to function with larger block sizes. It was found that the main kernel could be sped up by 16% by augmenting the algorithm so multiple elements were assigned to each thread. By assigning each thread block its own local C matrix, the need for locks on the C matrix was removed. The cost of the required reduction step, however, outweighed the benefit of the lock removal. The Cublas dgemm function showed that it is a suitable candidate to handle block sizes too large for the original method to process.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: