high performance computing on graphics processing units: hgpu.org

hgpu.org » GEMM

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Xiaoteng (Frank) Liu, Pavly Halim

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Energy-efficient computing, GEMM, Machine learning, Matrix multiplication, nVidia, nVidia GeForce RTX 4070, Package, Performance

December 1, 2024 by hgpu

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Afzal Ahmad, Linfeng Du, Wei Zhang

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, FPGA, GEMM, Linear Algebra, Machine learning, Matrix multiplication, OpenCL, Package

June 9, 2024 by hgpu

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

View

Download (PDF)

Tags: AI, Computer science, Deep learning, FPGA, GEMM, Matrix multiplication

April 21, 2024 by hgpu

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, GEMM, Linear Algebra, Matrix multiplication, nVidia, nVidia A100, Package, Performance, Reliability, Tesla T4

May 7, 2023 by hgpu

Harmonic CUDA: Asynchronous Programming on GPUs

Jonathan Wapman, Sean Treichler, Serban D. Porumbescu, John D. Owens

View

Download (PDF)

Tags: Computer science, CUDA, GEMM, Linear Algebra, Matrix multiplication, nVidia, nVidia A100

March 5, 2023 by hgpu

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, Dingwen Tao

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, GEMM, Linear Algebra, Matrix multiplication, nVidia, Tesla K40, Tesla M40, Tesla P100

February 16, 2020 by hgpu

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Chetan Jhurani, Paul Mullowney

View

Download (PDF)

Tags: BLAS, CUBLAS, CUDA, Dense linear algebra, GEMM, Linear Algebra, nVidia, Parallel programming, Tesla K20

April 9, 2013 by chetan.jhurani

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Harmonic CUDA: Asynchronous Programming on GPUs

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)