high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

Amrita Mathuriya, Thorsten Kurth, Vivek Rane, Mustafa Mustafa, Lei Shao, Debbie Bard, Prabhat, Victor W Lee

Intel Corporation

arXiv:1712.09388 [cs.DC], (26 Dec 2017)

BibTeX

Download (PDF)

View

Source

1697

views

We explore scaling of the standard distributed Tensorflow with GRPC primitives on up to 512 Intel Xeon Phi (KNL) nodes of Cori supercomputer with synchronous stochastic gradient descent (SGD), and identify causes of scaling inefficiency at higher node counts. To our knowledge, this is the first exploration of distributed GRPC Tensorflow scalability on a HPC supercomputer at such large scale with synchronous SGD. We studied scaling of two convolution neural networks – ResNet-50, a state-of-the-art deep network for classification with roughly 25.5 million parameters, and HEP-CNN, a shallow topology with less than 1 million parameters for common scientific usages. For ResNet-50, we achieve >80% scaling efficiency on up to 128 workers, using 32 parameter servers (PS tasks) with a steep decline down to 23% for 512 workers using 64 PS tasks. Our analysis of the efficiency drop points to low network bandwidth utilization due to combined effect of three factors. (a) Heterogeneous distributed parallelization algorithm which uses PS tasks as centralized servers for gradient averaging is suboptimal for utilizing interconnect bandwidth. (b) Load imbalance among PS tasks hinders their efficient scaling. (c) Underlying communication primitive GRPC is currently inefficient on Cori high-speed interconnect. The HEP-CNN demands less interconnect bandwidth, and shows >80% weak scaling efficiency for up to 256 nodes with only 1 PS task. Our findings are applicable to other deep learning networks. Big networks with millions of parameters stumble upon the issues discussed here. Shallower networks like HEP-CNN with relatively lower number of parameters can efficiently enjoy weak scaling even with a single parameter server.

Tags: Benchmarking, Computer science, Deep learning, Intel Xeon Phi, Neural networks, Performance, TensorFlow

January 6, 2018 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

Share this:

Recent source codes

Most viewed papers (last 30 days)