high performance computing on graphics processing units: hgpu.org

hgpu.org » BLAS

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Ezhilmathi Krishnasamy, Pascal Bouvry

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, CUDA, Differential equations, HPC, Linear Algebra, Matrix multiplication, nVidia, nVidia A100, OpenACC, Package, Partial differential equations, PDEs, Performance

December 24, 2024 by hgpu

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Afzal Ahmad, Linfeng Du, Wei Zhang

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, FPGA, GEMM, Linear Algebra, Machine learning, Matrix multiplication, OpenCL, Package

June 9, 2024 by hgpu

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Junjie Li, Yinzhi Wang, Xiao Liang, Hang Liu

View

Download (PDF)

Tags: BLAS, Chemistry, CUDA, nVidia, nVidia GH200, nVidia H100, Performance, Physics

May 5, 2024 by hgpu

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

Jou-An Chen, Hsin-Hsuan Sung, Nathan Tallent, Kevin Barker, Xipeng Shen, Ang Li

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, CUDA, Linear Algebra, nVidia, nVidia GeForce GTX 1080, nVidia GeForce GTX Titan V, Package, Sparse

January 30, 2022 by hgpu

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

Konstantin Herb, Pol Welter

View

Download (PDF)

Source codes

Tags: BLAS, Computational Physics, Computer science, CUDA, Numerical Analysis, nVidia, Package, Tesla V100

August 22, 2021 by hgpu

FBLAS: Streaming Linear Algebra Kernels on FPGA

Tiziano De Matteis, Johannes de Fine Licht, Torsten Hoefler

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, FPGA, Linear Algebra, OpenCL, Package

December 1, 2019 by hgpu

Out-of-core singular value decomposition

Vadim Demchik, Miroslav Bačák, Stefan Bordag

View

Download (PDF)

Tags: BLAS, Computer science, Mathematical Software, Numerical Analysis, Out of core, Sparse matrix

July 16, 2019 by hgpu

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, Louis Sugy

View

Download (PDF)

Source codes

Tags: AMD R9 Nano, ATI, BLAS, Computer science, Deep learning, Linear Algebra, Machine learning, Mathematical Software, OpenCL, Package, Performance, performance portability, SYCL

April 14, 2019 by hgpu

Implementing Push-Pull Efficiently in GraphBLAS

Carl Yang, Aydin Buluc, John D. Owens

View

Download (PDF)

Tags: Algorithm optimization, Algorithms, BLAS, Computer science, CUDA, Graph theory, Linear Algebra, nVidia, Sparse matrix, Tesla K40

April 15, 2018 by hgpu

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Azzam Haidar, Panruo Wu, Stanimire Tomov, Jack Dongarra

View

Download (PDF)

Tags: Algorithms, Artificial intelligence, BLAS, Computer science, Linar Algebra, Mixed precision, Neural networks, nVidia, Tesla P100

December 10, 2017 by hgpu

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds

Hamidreza Khaleghzadeh, Ziming Zhong, Ravi Reddy, Alexey Lastovetsky

View

Download (PDF)

Source codes

Tags: BLAS, Cloud, Computer science, CUBLAS, CUDA, FPGA, Heterogeneous systems, Intel Xeon Phi, Matrix multiplication, nVidia, OpenCL, Package, Virtualization

September 16, 2017 by hgpu

CLBlast: A Tuned OpenCL BLAS Library

Cedric Nugteren

View

Download (PDF)

Source codes

Tags: AMD Radeon R9 M370X, ARM, ATI, BLAS, Computer science, Intel HD 5100, Linear Algebra, Machine learning, nVidia, nVidia GeForce GTX 750 Ti, nVidia GeForce GTX Titan X, OpenCL, Package

May 18, 2017 by hgpu

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

FBLAS: Streaming Linear Algebra Kernels on FPGA

Out-of-core singular value decomposition

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels

Implementing Push-Pull Efficiently in GraphBLAS

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

CLBlast: A Tuned OpenCL BLAS Library

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)