Exascale Deep Learning for Scientific Inverse Problems
Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
arXiv:1909.11150 [cs.LG], (24 Sep 2019)
@misc{laanait2019exascale,
title={Exascale Deep Learning for Scientific Inverse Problems},
author={Laanait, Nouamane and Romero, Joshua and Yin, Junqi and Young, M. Todd and Treichler, Sean and Starchenko, Vitalii and Borisevich, Albina and Sergeev, Alex and Matheson, Michael},
year={2019},
eprint={1909.11150},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate our gradient reduction techniques in the context of training a Fully Convolutional Neural Network to approximate the solution of a longstanding scientific inverse problem in materials imaging. The efficient distributed training on a dataset size of 0.5 PB, produces a model capable of an atomically-accurate reconstruction of materials, and in the process reaching a peak performance of 2.15(4) EFLOPS16.
September 29, 2019 by hgpu