high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » GPU Tensor Cores for fast Arithmetic Reductions

GPU Tensor Cores for fast Arithmetic Reductions

Cristóbal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, Raimundo Vega

Institute of Informatics of Universidad Austral de Chile

arXiv:2001.05585 [cs.DC], (15 Jan 2020)

BibTeX

Download (PDF)

View

Source

1830

views

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of n numbers as a set of chained mxm matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is T(n)=5 log_m^2 n and its speedup is S=4/5 log_2 m^2 over the classic O(n log n) parallel reduction algorithm. Experimental performance results show that the proposed reduction method is ~3.2x faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of R MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of R=4,5 MMAs per block, while large thread-blocks work best with R=1. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

Tags: Algorithms, Computer science, CUDA, Deep learning, Machine learning, nVidia, Tesla V100, TPU

January 19, 2020 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

GPU Tensor Cores for fast Arithmetic Reductions

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

GPU Tensor Cores for fast Arithmetic Reductions

Share this:

Recent source codes

Most viewed papers (last 30 days)