Analyzing GPU Tensor Core Potential for Fast Reductions
Instituto de Informatica, Universidad Austral de Chile, Valdivia, Chile
arXiv:1903.03640 [cs.DC], (8 Mar 2019)
DOI:10.29007/zlmg
@article{Carrasco_Cavieres_2018,
title={Analyzing GPU Tensor Core Potential for Fast Reductions},
ISSN={2516-2314},
url={http://dx.doi.org/10.29007/zlmg},
DOI={10.29007/zlmg},
journal={EasyChair Preprints},
publisher={EasyChair},
author={Carrasco Cavieres, Roberto A. and Vega, Raimundo and Navarro, Cristobal A.},
year={2018},
month={Oct}
}
The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m×m MMA tensor-core operations (for Nvidia’s Volta architecture m=16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in T(n)=5*log_m^2(n) steps with a speedup of S=4/5*log_2(m^2).
March 17, 2019 by hgpu