high performance computing on graphics processing units: hgpu.org

hgpu.org » Matrix multiplication

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, GEMM, Linear Algebra, Matrix multiplication, nVidia, nVidia A100, Package, Performance, Reliability, Tesla T4

May 7, 2023 by hgpu

PopSparse: Accelerated block sparse matrix multiplication on IPU

Zhiyi Li, Douglas Orr, Valeriu Ohan, Godfrey Da costa, Tom Murray, Adam Sanders, Deniz Beker, Dominic Masters

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Linear Algebra, Machine learning, Matrix multiplication, nVidia, nVidia A100, Package, Sparse matrix

April 2, 2023 by hgpu

Harmonic CUDA: Asynchronous Programming on GPUs

Jonathan Wapman, Sean Treichler, Serban D. Porumbescu, John D. Owens

View

Download (PDF)

Tags: Computer science, CUDA, GEMM, Linear Algebra, Matrix multiplication, nVidia, nVidia A100

March 5, 2023 by hgpu

Extending MAGMA Portability with OneAPI

Anna Fortenberry, Stanimire Tomov

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Heterogeneous systems, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce RTX 3060, oneAPI, Package, performance portability

December 25, 2022 by hgpu

GPU Load Balancing

Muhammad Osama

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUDA, Linear Algebra, load balancing, Matrix multiplication, nVidia, nVidia A100, Package, Sparse, Thesis

December 25, 2022 by hgpu

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Genghan Zhang, Yuetong Zhao, Yanting Tao, Zhongming Yu, Guohao Dai, Sitao Huang, Yuan Wen, Pavlos Petoumenos, Yu Wang

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Matrix multiplication, nVidia, nVidia GeForce RTX 2080, nVidia GeForce RTX 3090, Package, Sparse matrix, Tesla V100

September 11, 2022 by hgpu

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Machine learning, Matrix multiplication, nVidia, nVidia A40, Package, PyTorch

August 21, 2022 by hgpu

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Wei Sun, Ang Li, Tong Geng, Sander Stuijk, Henk Corporaal

View

Download (PDF)

Tags: Benchmarking, Computer science, Matrix multiplication, nVidia, nVidia A100, nVidia GeForce RTX 2080, nVidia GeForce RTX 2080 Ti, Performance, PTX, Sparse matrix

June 12, 2022 by hgpu

Fast Arbitrary Precision Floating Point on FPGA

Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos Ziogas, David Simmons-Duffin, Torsten Hoefler

View

Download (PDF)

Source codes

Tags: Computer science, DSP, FPGA, Matrix multiplication, Package

April 17, 2022 by hgpu

Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions

Wilson Feng, Shucai Yao, Kai Ting Wang, Md Aamir Raihan, Laichun Feng, Chunrong Xu

View

Download (PDF)

Tags: Benchmarking, Computer science, Heterogeneous systems, Matrix multiplication, SYCL

April 10, 2022 by hgpu

Advanced Joins on GPUs

Christos Bellas

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, nVidia, nVidia GeForce GTX Titan XP, Package, Thesis

March 27, 2022 by hgpu

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Guohao Dai, Guyue Huang, Shang Yang, Zhongming Yu, Hengrui Zhang, Yufei Ding, Yuan Xie, Huazhong Yang, Yu Wang

View

Download (PDF)

Source codes

Tags: Algorithms, CUDA, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 2080, nVidia GeForce RTX 3090, Package, Sparse matrix, Tesla V100

February 20, 2022 by hgpu

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

OpenMP5-Offload-OpenMC-Intel-PVC

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

PopSparse: Accelerated block sparse matrix multiplication on IPU

Harmonic CUDA: Asynchronous Programming on GPUs

Extending MAGMA Portability with OneAPI

GPU Load Balancing

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Fast Arbitrary Precision Floating Point on FPGA

Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions

Advanced Joins on GPUs

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)