high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Janis Keuper

Fraunhofer ITWM

arXiv:1609.06870 [cs.CV], (22 Sep 2016)

BibTeX

Download (PDF)

View

Source

1658

views

This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neuronal Networks (DNNs). The presented results show, that the current state of the art approach, using data-parallelized Stochastic Gradient Descent (SGD), is quickly turning into a vastly communication bound problem. In addition, we present simple but fixed theoretic constraints, preventing effective scaling of DNN training beyond only a few dozen nodes. This leads to poor scalability of DNN training in most practical scenarios.

Tags: Caffe, Computer science, CUDA, Deep learning, Intel Xeon Phi, Neural networks, nVidia, Tesla K80

September 30, 2016 by hgpu

Rating: 1.8/5. From 5 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)