Implementing Efficient, Portable Computations for Machine Learning

hgpu.org » Programming » Algorithms » Implementing Efficient, Portable Computations for Machine Learning

Implementing Efficient, Portable Computations for Machine Learning

Matthew Walter Moskewicz

Electrical Engineering and Computer Sciences Department, University of California at Berkeley

University of California at Berkeley, Technical Report No. UCB/EECS-2017-37, 2017

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Boda: A C++ Framework for Efficient Experiments in Computer Vision [WIP]

2632

views

Computers are powerful tools which perform fast, accurate calculations over huge sets of data. However, many layers of abstraction are required to use computers for any given task. Recent advances in machine learning employ compute-intensive operations embedded in complex overall flows. Further, deployment of these systems must balance many concerns: accuracy, speed, energy, portability, and cost. Currently, for each target, a good implementation of the needed software layers requires many programmer-years of effort. To address this, we explore new tools and methods to amplify programmer effort for machine learning applications. In particular, we focus on portability and speed for machine learning operations, algorithms, and flows. Additionally, we wish to maintain accuracy and carefully control the complexity of the overall software system. First, we motivate our approach with a case study in developing libHOG, which provides highspeed primitives for calculating image gradient histograms, where we achieve a 3.6X speedup over the state of the art. Next, in DenseNet, we enable previously prohibitively slow multiscale sliding window object detection using dense convolutional neural network features. Finally, we propose our Boda framework for implementing artificial neural network computations, based on metaprogramming, specialization, and autotuning. In Boda, we explore in depth the development of efficient convolution operations across various types of hardware. With only a few months of effort, we achieve speed within 2X of the highly-tuned vendor library on NVIDIA Graphics Processing Units (GPUs). Further, in only a few weeks, we achieve up to 30% efficiency on Qualcomm mobile GPUs, where no vendor library exists.

Tags: Algorithms, AMD R9 Nano, ATI, Computer science, Computer vision, CUDA, Deep learning, Machine learning, Neural networks, nVidia, nVidia GeForce GTX Titan X, OpenCL, Package, Thesis

May 24, 2017 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org