19425

GPU Tensor Cores for fast Arithmetic Reductions

Cristóbal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, Raimundo Vega
Institute of Informatics of Universidad Austral de Chile
arXiv:2001.05585 [cs.DC], (15 Jan 2020)

@misc{navarro2020gpu,

   title={GPU Tensor Cores for fast Arithmetic Reductions},

   author={Cristóbal A. Navarro and Roberto Carrasco and Ricardo J. Barrientos and Javier A. Riquelme and Raimundo Vega},

   year={2020},

   eprint={2001.05585},

   archivePrefix={arXiv},

   primaryClass={cs.DC}

}

Download Download (PDF)   View View   Source Source   

1489

views

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of n numbers as a set of chained mxm matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is T(n)=5 log_m^2 n and its speedup is S=4/5 log_2 m^2 over the classic O(n log n) parallel reduction algorithm. Experimental performance results show that the proposed reduction method is ~3.2x faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of R MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of R=4,5 MMAs per block, while large thread-blocks work best with R=1. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: