high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

Mohammad Zubair, Christoph Bauinger

Old Dominion University, Norfolk, Virginia, USA

arXiv:2311.00368 [cs.LG], (1 Nov 2023)

DOI:10.48550/arXiv.2311.00368

BibTeX

Download (PDF)

View

Source

999

views

In this paper, we focus on three sparse matrix operations that are relevant for machine learning applications, namely, the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI’s Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or SYCL, the ESIMD API enables the writing of explicitly vectorized kernel code. Sparse matrix algorithms implemented with the ESIMD API achieved performance close to the peak of the targeted Intel Data Center GPU. We compare our performance results to Intel’s oneMKL library on Intel GPUs and to a recent CUDA implementation for the sparse matrix operations on NVIDIA’s V100 GPU and demonstrate that our implementations for sparse matrix operations outperform either.

Tags: Computer science, CUDA, Deep learning, Intel, Intel Data Center GPU Max 1550, Linear Algebra, Machine learning, Mathematical Software, Matrix multiplication, nVidia, nVidia V100, Sparse matrix, SYCL

November 5, 2023 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

Share this:

Recent source codes

Most viewed papers (last 30 days)