high performance computing on graphics processing units: hgpu.org

hgpu.org » Matrix multiplication

WgPy: GPU-accelerated NumPy-like array library for web browsers

Masatoshi Hidaka, Tatsuya Harada

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Matrix multiplication, nVidia, nVidia GeForce RTX 4070, Python

March 10, 2025 by hgpu

On the Partitioning of GPU Power among Multi-Instances

Tirth Vamja, Kaustabha Ray, Felix George, UmaMaheswari C Devi

View

Download (PDF)

Tags: Computer science, CUDA, Energy-efficient computing, Machine learning, Matrix multiplication, nVidia, nVidia A100, nVidia V100, Performance

February 3, 2025 by hgpu

Utilizing Tensor Cores in Futhark

Kristoffer August Kortbæk, Rune Ejnar Bang Lejbølle

View

Download (PDF)

Source codes

Tags: Benchmarking, Computer science, CUDA, Heterogeneous systems, High-level Languages, Matrix multiplication, nVidia, nVidia A100, Package, PTX

December 24, 2024 by hgpu

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Ezhilmathi Krishnasamy, Pascal Bouvry

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, CUDA, Differential equations, HPC, Linear Algebra, Matrix multiplication, nVidia, nVidia A100, OpenACC, Package, Partial differential equations, PDEs, Performance

December 24, 2024 by hgpu

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Xiaoteng (Frank) Liu, Pavly Halim

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Energy-efficient computing, GEMM, Machine learning, Matrix multiplication, nVidia, nVidia GeForce RTX 4070, Package, Performance

December 1, 2024 by hgpu

Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

Anjiang Wei, Allen Nie, Thiago S. F. X. Teixeira, Rohan Yadav, Wonchan Lee, Ke Wang, Alex Aiken

View

Download (PDF)

Tags: Code generation, Computer science, DSL, LLM, Matrix multiplication, nVidia, Optimization, Tesla P100

November 17, 2024 by hgpu

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

Matteo Croci, Garth N. Wells

View

Download (PDF)

Source codes

Tags: AVX, Computer science, Finite element method, Floating point error, Intel, Matrix multiplication, Mixed precision, Package

October 27, 2024 by hgpu

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

View

Download (PDF)

Tags: Computer science, CUDA, Data recovery, Extended precision, LLM, Matrix multiplication, nVidia, nVidia GeForce RTX 3090, Security

October 6, 2024 by hgpu

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Junqing Lin, Jingwei Sun, Xiaolong Shi, Honghe Zhang, Xianzhi Yu, Xinzhi Wang, Jun Yao, Guangzhong Sun

View

Download (PDF)

Tags: Compilers, Computer science, CUDA, Deep learning, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 2080 Ti, Performance, Sparse matrix, Tesla V100

August 4, 2024 by hgpu

How much can we gain from Tensor Kernel Fusion on GPUs?

Wei Sun, Ang Li, Sander Stuijk, Henk Corporaal

View

Download (PDF)

Tags: Computer science, CUDA, Deep learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, nVidia H100

June 16, 2024 by hgpu

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Afzal Ahmad, Linfeng Du, Wei Zhang

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, FPGA, GEMM, Linear Algebra, Machine learning, Matrix multiplication, OpenCL, Package

June 9, 2024 by hgpu

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L.A. Torres, Carlos J. Barrios H, Yves Denneulin

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia A100, Package, Performance, SYCL

June 2, 2024 by hgpu

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

WgPy: GPU-accelerated NumPy-like array library for web browsers

On the Partitioning of GPU Power among Multi-Instances

Utilizing Tensor Cores in Futhark

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

How much can we gain from Tensor Kernel Fusion on GPUs?

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)