high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Analyzing GPU Tensor Core Potential for Fast Reductions

Analyzing GPU Tensor Core Potential for Fast Reductions

Roberto Carrasco, Raimundo Vega, Cristóbal A. Navarro

Instituto de Informatica, Universidad Austral de Chile, Valdivia, Chile

arXiv:1903.03640 [cs.DC], (8 Mar 2019)

DOI:10.29007/zlmg

BibTeX

Download (PDF)

View

Source

1916

views

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m×m MMA tensor-core operations (for Nvidia’s Volta architecture m=16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in T(n)=5*log_m^2(n) steps with a speedup of S=4/5*log_2(m^2).

Tags: Computer science, CUDA, Deep learning, nVidia, Performance, Tesla V100

March 17, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Analyzing GPU Tensor Core Potential for Fast Reductions

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Analyzing GPU Tensor Core Potential for Fast Reductions

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)