17982

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

Xin Chen, Hua Zhou, Yuxiang Gao, Yu Zhu, Dongyan Wang
Emerging Technology Center, Midea Corporate Research Center, San Jose, CA, USA
arXiv:1802.02326 [cs.CV], (7 Feb 2018)

@article{chen2018mimatrix,

   title={MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster},

   author={Chen, Xin and Zhou, Hua and Gao, Yuxiang and Zhu, Yu and Wang, Dongyan},

   year={2018},

   month={feb},

   archivePrefix={"arXiv"},

   primaryClass={cs.CV}

}

Download Download (PDF)   View View   Source Source   

1375

views

In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s network and properties of different types of nodes, we develop and implement a novel job server parallel software framework, named by MiMatrix, for distributed deep learning training. Compared to the parameter server framework, in which parameter server is a bottleneck of data transfer in AllReduce algorithm of SSGD, the job server undertakes all of controlling, scheduling and monitoring tasks without model data transfer. In MiMatrix, we propose a novel GPUDirect Remote direct memory access (RDMA)-aware parallel algorithm of AllReucde executed by computing servers, which both computation and handshake message are O(1) at each epoch
Rating: 3.5/5. From 2 votes.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: