27623

Extending MAGMA Portability with OneAPI

Anna Fortenberry, Stanimire Tomov
Department of Computer Science and Engineering, University of North Texas, Denton, USA
Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, 2022

@article{fortenberry2022extending,

   title={Extending MAGMA Portability with OneAPI},

   author={Fortenberry, Anna and Tomov, Stanimire},

   year={2022}

}

Download Download (PDF)   View View   Source Source   Source codes Source codes

Package:

277

views

As the architectures of super-computing systems are continually changing, it is important to maintain efficient code portability in order to continue to take advantage of the computing capabilities of the diverse and evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability across different processor architectures. This paper evaluates Intel’s oneAPI by migrating a general matrix-matrix multiplication (GEMM) CUDA algorithm from the dense linear algebra library Matrix Algebra on GPU and Multicore Architectures (MAGMA) to Data Parallel C++ (DPC++), the direct programming language of oneAPI. The DPC++ Compatibility Tool (DPCT) in Intel’s oneAPI was used successfully for an initial port of MAGMA to DPC++. The performance of the migrated code is evaluated and compared to OpenMP GEMMs and state-of-the-art Intel MKL implementations on AMD EPYC 7742 multicore CPUs and Intel Xeon CPU E5-2698 V4 multicore CPUs, to the original native-CUDA code in MAGMA on NVIDIA GeForce RTX 3060 discrete GPUs, and to oneMKL on Intel UHD Graphics P630 [0x3e96] integrated GPUs. The initial migrated code demonstrates impressive performance on multicore CPUs as it significantly outperforms reference OpenMP implementations, and even MKL on AMD CPUs. Performance on Nvidia GPUs is also very surprising as the DPC++ code matches in performance the native CUDA code. The initial migrated code performed poorly on the Intel GPU, as expected, because the Intel GPU architecture used is quite different than the Nvidia GPU architecture for which the original code was designed. However, using the MAGMA’s parameterized implementations to tune the GEMM algorithm to better match the Intel GPU architecture, improved the performance significantly. Intel’s oneAPI allowed for a successful extension of MAGMA’s functional and performance portability to multicore CPUs and Intel GPUs.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2023 hgpu.org

All rights belong to the respective authors

Contact us: