high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Mehmet Deveci, Simon D. Hammond, Michael M. Wolf, Sivasankaran Rajamanickam

Sandia National Laboratories, Albuquerque, NM

arXiv:1804.00695 [cs.DC], (2 Apr 2018)

@article{deveci2018sparse,

title={Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments},

author={Deveci, Mehmet and Hammond, Simon D. and Wolf, Michael M. and Rajamanickam, Sivasankaran},

year={2018},

month={apr},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1579

views

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading high-performance computing architectures — Intel’s Knights Landing processor and NVIDIA’s Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms. Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as KNLs where the memory subsystems have similar latencies. However, for architectures such as GPUs where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla P100

April 7, 2018 by hgpu

Rating: 3.5/5. From 2 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Share this:

Recent source codes

Most viewed papers (last 30 days)