high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimising the DBCSR GPU Implementation

Optimising the DBCSR GPU Implementation

Jay Chetty

The University of Edinburgh

The University of Edinburgh, 2011

@article{chetty2011optimising,

title={Optimising the DBCSR GPU Implementation},

author={Chetty, J.},

year={2011}

}

Download (PDF)

View

Source

1691

views

The DBCSR library solves the sparse matrix multiplication required to perform atomistic simulations using the CP2K software. The GPU implementation of DBCSR was targeted for optimisation, and having its scope increased to allow it to function with larger block sizes. It was found that the main kernel could be sped up by 16% by augmenting the algorithm so multiple elements were assigned to each thread. By assigning each thread block its own local C matrix, the need for locks on the C matrix was removed. The cost of the required reduction step, however, outweighed the benefit of the lock removal. The Cublas dgemm function showed that it is a suitable candidate to handle block sizes too large for the original method to process.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla C2050, Thesis

December 31, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Optimising the DBCSR GPU Implementation

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Optimising the DBCSR GPU Implementation

Share this:

Recent source codes

Most viewed papers (last 30 days)