high performance computing on graphics processing units: hgpu.org

hgpu.org » Matrix multiplication

IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication

Zhen Xie, Guangming Tan, Weifeng Liu, Ninghui Sun

View

Download (PDF)

Source codes

Tags: Algorithms, Auto-Tuning, Computer science, CUDA, Deep learning, Matrix multiplication, nVidia, Sparse matrix, Tesla P100

June 23, 2019 by hgpu

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein

View

Download (PDF)

Source codes

Tags: Benchmarking, Code generation, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, Package, Performance, Tesla V100

May 12, 2019 by hgpu

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

James D. Stevens

View

Download (PDF)

Tags: AMD Radeon R9 Fury, ATI, Code generation, Computer science, Finite difference, Heterogeneous systems, Matrix multiplication, nVidia, nVidia GeForce GTX Titan X, OpenCL, Performance, Tesla C2070, Tesla K40

April 28, 2019 by hgpu

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Hamidreza Khaleghzadeh

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, FPGA, Heterogeneous systems, Matrix multiplication, nVidia, OpenCL, Package, Thesis

March 17, 2019 by hgpu

Supporting mixed-datatype matrix multiplication within the BLIS framework

Field G. Van Zee, Devangi N. Parikh, Robert A. van de Geijn

View

Download (PDF)

Source codes

Tags: Computer science, Matrix multiplication, OpenMP, Package

January 27, 2019 by hgpu

Performance Evaluation and Tuning of An OpenCL based Matrix Multiplier

Yiyu Tan, Toshiyuki Imamura

View

Download (PDF)

Tags: Computer science, FPGA, Linear Algebra, Matrix multiplication, OpenCL

September 2, 2018 by hgpu

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs

Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, Package, Tesla V100

September 2, 2018 by hgpu

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

Daniel Hanlon, Hamidreza Khalighzadeh, Ravi Reddy Manumachu, Alexey Lastovetsky

View

Download (PDF)

Source codes

Tags: Cloud, Computer science, CUDA, FPGA, Hybrid computing, Intel Xeon Phi, Matrix multiplication, nVidia, OpenCL, Package

August 19, 2018 by hgpu

Implementing general matrix-matrix multiplication algorithm on the Intel Xeon Phi Knights Landing Processor

Raehyun Kim

View

Download (PDF)

Tags: Algorithms, Computer science, Intel Xeon Phi, Linear Algebra, Matrix multiplication, Optimization

June 13, 2018 by hgpu

Learning to Optimize Tensor Programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

View

Download (PDF)

Tags: ARM, Computer science, CUDA, Deep learning, Matrix multiplication, nVidia, nVidia GeForce GTX Titan X

May 26, 2018 by hgpu

High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, Aydin Buluc

View

Download (PDF)

Tags: Algorithms, Computer science, Intel Xeon Phi, Matrix multiplication, Sparse matrix

April 7, 2018 by hgpu

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Mehmet Deveci, Simon D. Hammond, Michael M. Wolf, Sivasankaran Rajamanickam

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla P100

April 7, 2018 by hgpu

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Supporting mixed-datatype matrix multiplication within the BLIS framework

Performance Evaluation and Tuning of An OpenCL based Matrix Multiplier

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs

Implementing general matrix-matrix multiplication algorithm on the Intel Xeon Phi Knights Landing Processor

Learning to Optimize Tensor Programs

High-performance sparse matrix-matrix products on Intel KNL and multicore architectures

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)