high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor

Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor

Pawel Gepner, Victor Gamayunov, David L. Fraser, Eric Houdard, Ludovic Sauge, Damien Declat, Mathieu Dubois

Intel Corporation, Pipers Way, Swindon Wiltshire SN3 1RJ, United Kingdom

Journal of Computers, Vol. 9, No. 7, 2014

DOI:10.4304/jcp.9.7.1566-1571

View

Source

1985

views

In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running "natively" on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimizations were designed to support maximal reuse of on-die cache, which significantly reduces transfer from GDDR. Finally we analyze the improvement of a classic matrix multiplication implementation based on Cauchy algorithm compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine.

Tags: Algorithms, Computer science, Intel Xeon Phi, Linear Algebra, Matrix multiplication, Performance

June 28, 2014 by hgpu

No votes yet.

Please wait...

PELSI: Power-Efficient Layer-Switched Inference

Efficient deep learning inference on end devices

Ouroboros: Virtualized Queues for dynamic memory management

Dynamic Memory Management on GPUs with SYCL

MSCCL++: A GPU-driven communication stack for scalable AI applications

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

Benchmark compute shader of Unity against InteropUnityCUDA

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Data-efficient LLM Fine-tuning for Code Generation

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us:

contact@hpgu.org