high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimising the DBCSR GPU Implementation

Optimising the DBCSR GPU Implementation

Jay Chetty

The University of Edinburgh

The University of Edinburgh, 2011

BibTeX

Download (PDF)

View

Source

2027

views

The DBCSR library solves the sparse matrix multiplication required to perform atomistic simulations using the CP2K software. The GPU implementation of DBCSR was targeted for optimisation, and having its scope increased to allow it to function with larger block sizes. It was found that the main kernel could be sped up by 16% by augmenting the algorithm so multiple elements were assigned to each thread. By assigning each thread block its own local C matrix, the need for locks on the C matrix was removed. The cost of the required reduction step, however, outweighed the benefit of the lock removal. The Cublas dgemm function showed that it is a suitable candidate to handle block sizes too large for the original method to process.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla C2050, Thesis

December 31, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimising the DBCSR GPU Implementation

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Optimising the DBCSR GPU Implementation

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)