high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving the speed of neural networks on CPUs

Improving the speed of neural networks on CPUs

Vincent Vanhoucke, Andrew Senior, Mark Z. Mao

Google, Inc., Mountain View, CA 94043

Deep Learning and Unsupervised Feature Learning Workshop, NIPS, 2011

@article{vanhoucke2011improving,

title={Improving the speed of neural networks on CPUs},

author={Vanhoucke, V. and Senior, A. and Mao, M.Z.},

year={2011}

}

Download (PDF)

View

Source

4242

views

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3x improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10x speedup over an unoptimized baseline and a 4x speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Tags: Computer science, CUDA, Neural networks, nVidia, Speech recognition, Tesla C2070, Tutorial

December 28, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Improving the speed of neural networks on CPUs

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Improving the speed of neural networks on CPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)