high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, Christopher Re

Department of Computer Science, Stanford University

arXiv:1606.04487 [cs.DC], (14 Jun 2016)

@article{hadjis2016omnivore,

title={Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs},

author={Hadjis, Stefan and Zhang, Ce and Mitliagkas, Ioannis and Re, Christopher},

year={2016},

month={jun},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

2399

views

We perform a study of the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, we study how to minimize the time to train this model on a cluster of commodity CPUs and GPUs. Our first contribution focuses on the single-node setting, in which we show that by using standard batching and data-parallel techniques throughput can be improved by at least 5.5x over state-of-the-art systems when training on CPUs. This ensures an end-to-end training time directly proportional to the throughput of a device regardless of its underlying hardware, allowing each node in the cluster to be treated as a black box. Our second contribution is a theoretical and empirical study of the tradeoffs affecting end-to-end training time in a multiple-device setting. We identify the degree of asynchronous parallelization as a key feature affecting both hardware and statistical efficiency. We show that asynchrony can be viewed as introducing a momentum parameter, which we use to limit our search space; in turn, this leads to a simpler optimizer, which is our third contribution. Our optimizer involves a predictive model for the total time to convergence and selects an allocation of resources to minimize that time. We demonstrate that the most popular distributed deep learning systems fall within our tradeoff space but do not optimize within the space. By doing such optimization, our prototype runs 1.9x to 12x faster than the fastest state-of-the-art systems.

Tags: Computer science, CUDA, Deep learning, Machine learning, Neural networks, nVidia, nVidia GeForce GTX Titan X

June 16, 2016 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)