MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster
Emerging Technology Center, Midea Corporate Research Center, San Jose, CA, USA
arXiv:1802.02326 [cs.CV], (7 Feb 2018)
@article{chen2018mimatrix,
title={MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster},
author={Chen, Xin and Zhou, Hua and Gao, Yuxiang and Zhu, Yu and Wang, Dongyan},
year={2018},
month={feb},
archivePrefix={"arXiv"},
primaryClass={cs.CV}
}
In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s network and properties of different types of nodes, we develop and implement a novel job server parallel software framework, named by MiMatrix, for distributed deep learning training. Compared to the parameter server framework, in which parameter server is a bottleneck of data transfer in AllReduce algorithm of SSGD, the job server undertakes all of controlling, scheduling and monitoring tasks without model data transfer. In MiMatrix, we propose a novel GPUDirect Remote direct memory access (RDMA)-aware parallel algorithm of AllReucde executed by computing servers, which both computation and handshake message are O(1) at each epoch
February 9, 2018 by hgpu