Machine Learning at the Limit

hgpu.org » Applications » Computer science » Machine Learning at the Limit

Machine Learning at the Limit

John Canny, Huasha Zhao, Ye Chen, Bobby Jaros, Jiangchang Mao

UC Berkeley, Berkeley, CA 94720, USA

IEEE Big Data, 2015

@inproceedings{canny2015machine,

title={Machine learning at the limit},

author={Canny, John and Zhao, Huasha and Jaros, Bobby and Chen, Ye and Mao, Jiangchang},

booktitle={Big Data (Big Data), 2015 IEEE International Conference on},

pages={233–242},

year={2015},

organization={IEEE}

}

Download (PDF)

View

Source

Source codes

Package:

BIDMach: CPU and GPU-accelerated Machine Learning Library

2051

views

Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes ALU, memory, and network limits, and the constraints they imply for algorithms. Using roofline design, we have developed a system called BIDMach which has demonstrated the highest performance to date for many ML problems. On one GPU-accelerated node, it generally outperforms other single-machine toolkits and cluster toolkits running on 100s of nodes. This performance level is enabled by a relatively small number of rooflined matrix primitives. Such performance implies a dramatic reduction in the energy used to perform these calculations. Beyond matrix kernels, roofline design can be applied to the end-to-end design of machine learning algorithms which minimize memory usage to optimize speed. This approach offers a further 2x to 3x gain in performance. Roofline design can also be applied to network primitives. We describe recent work on a sparse allreduce primitive called Kylix. We have shown that Kylix approaches the practical network throughput limit for allreduce, a basic primitive for distributed machine learning. Using Kylix, we describe an efficient transformation from model-parallel to data-parallel calculations. This transformation uses a secondary storage roofline, with similar parameters to the network. Finally, we describe several deployments of these techniques on real-world problems in two large internet companies. Once again, single node rooflined design demonstrated substantial gains over alternatives on either single nodes or clusters.

Tags: Computer science, CUDA, Machine learning, nVidia, nVidia GeForce GTX 680, nVidia GeForce GTX Titan X, Package

March 12, 2016 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org