high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, Eric P. Xing

Carnegie Mellon University

ACM European Conference on Computer Systems (EuroSys’16), 2016

BibTeX

Download (PDF)

View

Source

2047

views

Large-scale deep learning requires huge computational resources to train a multi-layer neural network. Recent systems propose using 100s to 1000s of machines to train networks with tens of layers and billions of connections. While the computation involved can be done more efficiently on GPUs than on more traditional CPU cores, training such networks on a single GPU is too slow and training on distributed GPUs can be inefficient, due to data movement overheads, GPU stalls, and limited GPU memory. This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles. We show that GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single-node code). Moreover, GeePS achieves a higher training throughput with just four GPU machines than that a state-of-the-art CPU-only system achieves with 108 machines.

Tags: Caffe, Computer science, CUDA, Deep learning, Neural networks, nVidia, Tesla K20

April 16, 2016 by hgpu

Rating: 0.5/5. From 2 votes.

Please wait...

high performance computing on graphics processing units: hgpu.org

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Share this:

Recent source codes

Most viewed papers (last 30 days)