high performance computing on graphics processing units: hgpu.org

hgpu.org » Matrix multiplication

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Guohao Dai, Guyue Huang, Shang Yang, Zhongming Yu, Hengrui Zhang, Yufei Ding, Yuan Xie, Huazhong Yang, Yu Wang

View

Download (PDF)

Source codes

Tags: Algorithms, CUDA, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 2080, nVidia GeForce RTX 3090, Package, Sparse matrix, Tesla V100

February 20, 2022 by hgpu

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs

Erfan Bank Tavakoli, Michael Riera, Masudul Hassan Quraishi, Fengbo Ren

View

Download (PDF)

Tags: Algorithms, Computer science, FPGA, HPC, Linear Algebra, Matrix multiplication, nVidia, nVidia GTX Titan X, OpenCL, Sparse matrix

December 26, 2021 by hgpu

TCUDB: Accelerating Database with Tensor Processors

Yu-Ching Hu, Yuliang Li, Hung-Wei Tseng

View

Download (PDF)

Tags: Computer science, CUDA, Databases, Machine learning, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 3090

December 19, 2021 by hgpu

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

Navdeep Katel, Vivek Khandelwal, Uday Bondhugula

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUBLAS, CUDA, HPC, Matrix multiplication, nVidia, nVidia GeForce RTX 3090, Package

September 5, 2021 by hgpu

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Jan Solanti, Michal Babej, Julius Ikkala, Vinod Kumar Malamal Vadakital, Pekka Jääskeläinen

View

Download (PDF)

Source codes

Tags: Computer science, GPU cluster, Heterogeneous systems, Matrix multiplication, nVidia, nVidia GeForce GTX 1060, nVidia GeForce GTX 2080 Ti, OpenCL, Package, Rendering, Tesla P100, Tesla V100

August 8, 2021 by hgpu

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning

Devi Abhiseshu Adurti, Mohit Battu

View

Download (PDF)

Tags: Computer science, CUDA, Heterogeneous systems, Machine learning, Matrix multiplication, nVidia, nVidia GeForce GTX 1050 Ti, Thesis

July 4, 2021 by hgpu

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Xiaoyan Liu, Yi Liu, Ming Dun, Bohong Yin, Hailong Yang, Zhongzhi Luan, Depei Qian

View

Download (PDF)

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, Performance, Sparse matrix, Tesla V100

March 28, 2021 by hgpu

Evaluating the Performance and Portability of Contemporary SYCL Implementations

Beau Johnston, Jeffrey S. Vetter, Josh Milthorpe

View

Download (PDF)

Source codes

Tags: AMD Radeon VII, ATI, Benchmarking, Computer science, CUDA, Heterogeneous systems, HIP, Matrix multiplication, nVidia, OpenCL, Package, Performance, Tesla P100

November 29, 2020 by hgpu

OpenCL Performance on the Intel Heterogeneous Architecture Research Platform

Steven Harris, Roger D. Chamberlain, Christopher Gill

View

Download (PDF)

Tags: Computer science, FPGA, Heterogeneous systems, Matrix multiplication, OpenCL, Optimization, performance portability

October 25, 2020 by hgpu

Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert, Tim Besard, Bjorn De Sutter

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Julia, Machine learning, Mathematical Software, Matrix multiplication, Mixed precision, nVidia, nVidia GeForce RTX 2080 Ti, Package, Performance

October 4, 2020 by hgpu

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, Joaquín Olivares

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, Mixed precision, nVidia, nVidia GeForce RTX 2070, nVidia Titan RTX, Package, Performance, Sparse matrix

October 4, 2020 by hgpu

Heterogeneous parallel computing for image registration and linear algebra applications

Orestis Zachariadis

View

Download (PDF)

Source codes

Tags: CUDA, Heterogeneous systems, Image processing, Image registration, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce GTX 1050, nVidia GeForce RTX 2070, Package, Sparse matrix, Thesis

August 9, 2020 by hgpu

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs

TCUDB: Accelerating Database with Tensor Processors

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Evaluating the Performance and Portability of Contemporary SYCL Implementations

OpenCL Performance on the Intel Heterogeneous Architecture Research Platform

Flexible Performant GEMM Kernels on GPUs

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Heterogeneous parallel computing for image registration and linear algebra applications

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)