high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

The University of Texas at Austin, USA

arXiv:2404.11066 [cs.AR], (17 Apr 2024)

DOI:10.48550/arXiv.2404.11066

BibTeX

Download (PDF)

View

Source

789

views

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.

Tags: AI, Computer science, Deep learning, FPGA, GEMM, Matrix multiplication

April 21, 2024 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Share this:

Recent source codes

Most viewed papers (last 30 days)