high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Experiments on Parallel Training of Deep Neural Network using Model Averaging

Experiments on Parallel Training of Deep Neural Network using Model Averaging

Hang Su, Haoyu Chen

International Computer Science Institute, Berkeley, California, US

arXiv:1507.01239 [cs.LG], (5 Jul 2015)

BibTeX

Download (PDF)

View

Source

1891

views

In this work we apply model averaging to parallel training of deep neural network (DNN). Parallelization is done in a model averaging manner. Data is partitioned and distributed to different nodes for local model updates, and model averaging across nodes is done every few minibatches. We use multiple GPUs for data parallelization, and Message Passing Interface (MPI) for communication between nodes, which allows us to perform model averaging frequently without losing much time on communication. We investigate the effectiveness of Natural Gradient Stochastic Gradient Descent (NG-SGD) and Restricted Boltzmann Machine (RBM) pretraining for parallel training in model-averaging framework, and explore the best setups in term of different learning rate schedules, averaging frequencies and minibatch sizes. It is shown that NG-SGD and RBM pretraining benefits parameter-averaging based model training. On the 300h Switchboard dataset, a 9.3 times speedup is achieved using 16 GPUs and 17 times speedup using 32 GPUs with limited decoding accuracy loss.

Tags: Computer science, CUDA, Deep learning, MPI, Neural networks, nVidia, Tesla K20

July 8, 2015 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Experiments on Parallel Training of Deep Neural Network using Model Averaging

Your response

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)

Experiments on Parallel Training of Deep Neural Network using Model Averaging

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)