high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Extending MAGMA Portability with OneAPI

Extending MAGMA Portability with OneAPI

Anna Fortenberry, Stanimire Tomov

Department of Computer Science and Engineering, University of North Texas, Denton, USA

Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, 2022

BibTeX

Download (PDF)

View

Source

Source codes

Package:

oneMAGMA-example

1053

views

As the architectures of super-computing systems are continually changing, it is important to maintain efficient code portability in order to continue to take advantage of the computing capabilities of the diverse and evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability across different processor architectures. This paper evaluates Intel’s oneAPI by migrating a general matrix-matrix multiplication (GEMM) CUDA algorithm from the dense linear algebra library Matrix Algebra on GPU and Multicore Architectures (MAGMA) to Data Parallel C++ (DPC++), the direct programming language of oneAPI. The DPC++ Compatibility Tool (DPCT) in Intel’s oneAPI was used successfully for an initial port of MAGMA to DPC++. The performance of the migrated code is evaluated and compared to OpenMP GEMMs and state-of-the-art Intel MKL implementations on AMD EPYC 7742 multicore CPUs and Intel Xeon CPU E5-2698 V4 multicore CPUs, to the original native-CUDA code in MAGMA on NVIDIA GeForce RTX 3060 discrete GPUs, and to oneMKL on Intel UHD Graphics P630 [0x3e96] integrated GPUs. The initial migrated code demonstrates impressive performance on multicore CPUs as it significantly outperforms reference OpenMP implementations, and even MKL on AMD CPUs. Performance on Nvidia GPUs is also very surprising as the DPC++ code matches in performance the native CUDA code. The initial migrated code performed poorly on the Intel GPU, as expected, because the Intel GPU architecture used is quite different than the Nvidia GPU architecture for which the original code was designed. However, using the MAGMA’s parameterized implementations to tune the GEMM algorithm to better match the Intel GPU architecture, improved the performance significantly. Intel’s oneAPI allowed for a successful extension of MAGMA’s functional and performance portability to multicore CPUs and Intel GPUs.

Tags: Computer science, CUDA, Heterogeneous systems, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce RTX 3060, oneAPI, Package, performance portability

December 25, 2022 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Extending MAGMA Portability with OneAPI

Package:

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Extending MAGMA Portability with OneAPI

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)