high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Azzam Haidar, Panruo Wu, Stanimire Tomov, Jack Dongarra

University of Tennessee, Knoxville, Knoxville, TN

8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA17), 2017

@article{haidar2017investigating,

title={Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers},

author={Haidar, Azzam and Wu, Panruo and Tomov, Stanimire and Dongarra, Jack},

year={2017}

}

Download (PDF)

View

Source

2759

views

The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today’s powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.

Tags: Algorithms, Artificial intelligence, BLAS, Computer science, Linar Algebra, Mixed precision, Neural networks, nVidia, Tesla P100

December 10, 2017 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)