high performance computing on graphics processing units: hgpu.org

hgpu.org » Mixed precision

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

Matteo Croci, Garth N. Wells

View

Download (PDF)

Source codes

Tags: AVX, Computer science, Finite element method, Floating point error, Intel, Matrix multiplication, Mixed precision, Package

October 27, 2024 by hgpu

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Machine learning, Matrix multiplication, Mixed precision, nVidia, nVidia A100, nVidia GeForce RTX 4090, nVidia RTX A6000, Package

February 18, 2024 by hgpu

Memory Efficient Mixed-Precision Optimizers

Basile Lewandowski, Atli Kosson

View

Download (PDF)

Tags: Computer science, CUDA, Machine learning, Mixed precision, nVidia, nVidia A100, nVidia V100

October 1, 2023 by hgpu

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Gabriel Freytag

View

Download (PDF)

Tags: Computer science, CUDA, Machine learning, Mixed precision, Neural networks, nVidia, nVidia A100, nVidia P100, Thesis

June 18, 2023 by hgpu

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Shigang Li, Kazuki Osawa, Torsten Hoefler

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Linear Algebra, Mixed precision, nVidia, nVidia A100, Package, Sparse matrix, Tesla V100

October 2, 2022 by hgpu

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Moritz Lehmann, Mathias J. Krause, Giorgio Amati, Marcello Sega, Jens Harting, Stephan Gekle

View

Download (PDF)

Tags: AMD Radeon Instinct MI100, AMD Radeon VI, ATI, Fluid dynamics, lattice Boltzmann, Mixed precision, nVidia, OpenCL, Tesla K20, Tesla K40, Tesla K80, Tesla P100, Tesla V100

December 19, 2021 by hgpu

Mixed precision in Graphics Processing Unit

Quentin Gallouédec

View

Download (PDF)

Tags: Computer science, Machine learning, Mixed precision, Neural networks, nVidia, Tesla V100

October 31, 2021 by hgpu

A Study of Mixed Precision Strategies for GMRES on GPUs

Jennifer A. Loe, Christian A. Glusa, Ichitaro Yamazaki, Erik G. Boman, Sivasankaran Rajamanickam

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, Linear Algebra, Mixed precision, nVidia, performance portability, Tesla V100

September 19, 2021 by hgpu

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Binrui Li, Shenggan Cheng, James Lin

View

Download (PDF)

Tags: Computer science, FFT, Mixed precision, nVidia, nVidia DGX-2, nVidia DGX-A100, Tesla V100

May 2, 2021 by hgpu

Mixed-Precision Embedding Using a Cache

Jie (Amy)Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, Andrew Tulloch

View

Download (PDF)

Tags: Artificial intelligence, Computer science, CUDA, Machine learning, Mixed precision, nVidia, Tesla V100

October 25, 2020 by hgpu

Flexible Performant GEMM Kernels on GPUs

Thomas Faingnaert, Tim Besard, Bjorn De Sutter

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Julia, Machine learning, Mathematical Software, Matrix multiplication, Mixed precision, nVidia, nVidia GeForce RTX 2080 Ti, Package, Performance

October 4, 2020 by hgpu

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, Joaquín Olivares

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, Mixed precision, nVidia, nVidia GeForce RTX 2070, nVidia Titan RTX, Package, Performance, Sparse matrix

October 4, 2020 by hgpu

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Mixed-precision finite element kernels and assembly: Rounding error analysis and hardware acceleration

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Memory Efficient Mixed-Precision Optimizers

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Efficient Quantized Sparse Matrix Operations on Tensor Cores

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Mixed precision in Graphics Processing Unit

A Study of Mixed Precision Strategies for GMRES on GPUs

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Mixed-Precision Embedding Using a Cache

Flexible Performant GEMM Kernels on GPUs

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)