high performance computing on graphics processing units: hgpu.org

hgpu.org » CUBLAS

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Hossein Albakri, Kazem Cheshmi

View

Tags: Compilers, Computer science, CUBLAS, CUDA, Machine learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, Programming Languages, Sparse matrix

June 22, 2025 by hgpu

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L.A. Torres, Carlos J. Barrios H, Yves Denneulin

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia A100, Package, Performance, SYCL

June 2, 2024 by hgpu

DGEMM on Integer Matrix Multiplication Unit

Hiroyuki Ootomo, Katsuhisa Ozaki, Rio Yokota

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Deep learning, Linear Algebra, Machine learning, Matrix multiplication, nVidia, nVidia A100, nVidia Jetson AGX Orin, nVidia RTX 6000 Ada, nVidia Titan RTX, Package

June 25, 2023 by hgpu

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

Yannik Könneker

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, nVidia, nVidia GeForce GTX 1080, OpenACC, OpenCL, Package, Performance, Tesla V100, Thesis

April 17, 2022 by hgpu

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

Navdeep Katel, Vivek Khandelwal, Uday Bondhugula

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUBLAS, CUDA, HPC, Matrix multiplication, nVidia, nVidia GeForce RTX 3090, Package

September 5, 2021 by hgpu

CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Sarikivi

View

Download (PDF)

Tags: Code generation, Computer science, CUBLAS, CUDA, Deep learning, Machine learning, nVidia, nVidia DGX-2

May 23, 2021 by hgpu

SLATE port to AMD and Intel platforms

Ahmad Abdelfattah, Mohammed Al Farhan, Cade Brown, Mark Gates, Dalal Sukkari, Asim YarKhan, Jack Dongarra

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, nVidia, OpenCL, Package, SYCL

April 18, 2021 by hgpu

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Xiaoyan Liu, Yi Liu, Ming Dun, Bohong Yin, Hailong Yang, Zhongzhi Luan, Depei Qian

View

Download (PDF)

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, Performance, Sparse matrix, Tesla V100

March 28, 2021 by hgpu

Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert, Tim Besard, Bjorn De Sutter

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Julia, Machine learning, Mathematical Software, Matrix multiplication, Mixed precision, nVidia, nVidia GeForce RTX 2080 Ti, Package, Performance

October 4, 2020 by hgpu

Automatic Kernel Generation for Volta Tensor Cores

Somashekaracharya G. Bhaskaracharya, Julien Demouth, Vinod Grover

View

Download (PDF)

Tags: Compilers, Computer science, CUBLAS, CUDA, Deep learning, Matrix multiplication, nVidia, nVidia Quadro GV100, Performance, Programming Languages, PTX

June 28, 2020 by hgpu

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Shaohuai Shi, Qiang Wang, Xiaowen Chu

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, nVidia GeForce GTX 980, nVidia GeForce GTX Titan X, Package, Sparse matrix, Tesla P100

June 7, 2020 by hgpu

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, Vinod Grover

View

Download (PDF)

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce GTX 750 Ti, nVidia GeForce RTX 2080 Ti, nVidia Quadro GV100, Programming Languages

March 22, 2020 by hgpu

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

exa-AMD: Exascale Accelerated Materials Discovery

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

DGEMM on Integer Matrix Multiplication Unit

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning

SLATE port to AMD and Intel platforms

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Flexible Performant GEMM Kernels on GPUs

Automatic Kernel Generation for Volta Tensor Cores

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)