high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Modeling Deep Learning Accelerator Enabled GPUs

Modeling Deep Learning Accelerator Enabled GPUs

Md Aamir Raihan, Negar Goli, Tor Aamodt

Electrical and Computer Engineering, University of British Columbia

arXiv:1811.08309 [cs.MS], (19 Nov 2018)

BibTeX

Download (PDF)

View

Source

2678

views

The efficacy of deep learning has resulted in it becoming one of the most important applications run in data centers today. The NVIDIA Tesla V100 GPU introduced a specialized functional unit called the Tensor Core to meet growing demand for higher performance on this workload. To exploit the full capability of current NVIDIA GPUs machine learning researchers have started to use Tensor Cores. For example, 5 out of 6, 2018 Gordon Bell Award Finalists used Tensor Cores in their work. However, currently no open-source GPU microarchitectural simulators model Tensor Cores. In this paper, we comprehensively investigate NVIDIA’s Tensor Core implementation found in Volta and Turing architectures and propose an architectural model for it. Our Tensor Core timing model, implemented in GPGPU-Sim, achieves 99.6% IPC correlation versus a physical V100 GPU. Building upon this we also enable GPGPU-Sim to run NVIDIA’s CUTLASS, an open-source CUDA C++ templates library providing customizable GEMM templates including the support for Tensor Cores.

Tags: Computer science, CUDA, Deep learning, Hardware Architecture, Machine learning, nVidia, Tesla V100

November 25, 2018 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Modeling Deep Learning Accelerator Enabled GPUs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Modeling Deep Learning Accelerator Enabled GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)