Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures

hgpu.org » Programming » Algorithms » Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures

Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures

Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk

INRIA – Centre de recherche Rennes – Bretagne Atlantique, Campus de Beaulieu, F-35042 Rennes Cedex, France

hal-00949355, (25 February 2014)

@techreport{collange:hal-00949355,

hal_id={hal-00949355},

url={http://hal.archives-ouvertes.fr/hal-00949355},

title={Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures},

author={Collange, Sylvain and Defour, David and Graillat, Stef and Iakymchuk, Roman},

keywords={Parallel floating-point summation, reproducibility, accuracy, long accumulator, multi-precision, multi- and many-core architectures.},

language={Anglais},

affiliation={ALF – INRIA – IRISA , Laboratoire d’Informatique de Robotique et de Micro{‘e}lectronique de Montpellier – LIRMM , Digits, Architectures et Logiciels Informatiques – DALI , Laboratoire d’Informatique de Paris 6 – LIP6},

year={2014},

month={Feb},

pdf={http://hal.archives-ouvertes.fr/hal-00949355/PDF/superaccumulator.pdf}

}

Download (PDF)

View

Source

2054

views

On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity of floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filtering stage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

Tags: Algorithms, ATI, ATI Radeon HD 7970, Computer science, Extended precision, Intel Xeon Phi, nVidia, OpenCL, Tesla K20

February 26, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org