high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Benjamin Brock, Aydın Buluç, Katherine Yelick

EECS Department, University of California, Berkeley, CA

arXiv:2311.18141 [cs.DC], (29 Nov 2023)

DOI:10.48550/arXiv.2311.18141

BibTeX

Download (PDF)

View

Source

1264

views

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, MPI, nVidia, nVidia DGX-2, Sparse matrix, Tesla V100

December 3, 2023 by hgpu

No votes yet.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)