high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms

Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms

Matthew Badin, Paolo D’Alberto, Lubomir Bic, Michael Dillencourt, Alexandru Nicolau

University of California Irvine, Irvine, CA 92697

International Conference on Supercomputing (ICS), 2013

BibTeX

Download (PDF)

View

Source

2357

views

Scientific computing is only bound by the limits of Moore’s Law and the scalability of high performance mathematical library implementations. Most mathematical libraries however tend to focus only on general inputs, limiting their potential performance and scalability by not tailoring their implementation to specific inputs, such as non-negative inputs. By removing this limitation it is possible to improve the performance and accuracy of a range of problems. In this paper we explore the limitations of hardware to improve accuracy of non-negative matrix multiply by specifically comparing implementations on the GPU and CPU and propose algorithmic solutions to improve accuracy. Next, we demonstrate a matrix multiply implementation that takes advantage of asymptotically fast matrix multiply algorithms, which have been shown to scale better than O(N^3) matrix multiply implementations, and improve accuracy by up to a whole digit while increasing performance by up to 27% for matrices where the input is positive. Finally, we propose to extend the BLAS level 3 specification to non-negative matrices to allow easy integration of our solution and allow other library authors to implement their own solutions as part of an existing standard.

Tags: Algorithms, BLAS, Computer science, CUDA, Matrix multiplication, nVidia, Performance, Tesla C2070

April 30, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)