high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Computer vision » Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He

Facebook

arXiv:1706.02677 [cs.CV], (8 Jun 2017)

@article{goyal2017accurate,

title={Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour},

author={Goyal, Priya and Dollar, Piotr and Girshick, Ross and Noordhuis, Pieter and Wesolowski, Lukasz and Kyrola, Aapo and Tulloch, Andrew and Jia, Yangqing and He, Kaiming},

year={2017},

month={jun},

archivePrefix={"arXiv"},

primaryClass={cs.CV}

}

Download (PDF)

View

Source

Source codes

Package:

gloo: Collective communications library with various primitives for multi-machine training

4472

views

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.

Tags: Computer science, Computer vision, CUDA, Deep learning, Neural networks, nVidia, Package, Tesla P100

June 10, 2017 by hgpu

Rating: 1.8/5. From 3 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Package:

Your response

Recent source codes

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

Most viewed papers (last 30 days)

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)