12909

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer
NVIDIA, Santa Clara, CA 95050
arXiv:1410.0759 [cs.NE], (3 Oct 2014)

@article{2014arXiv1410.0759C,

   author={Chetlur}, S. and {Woolley}, C. and {Vandermersch}, P. and {Cohen}, J. and {Tran}, J. and {Catanzaro}, B. and {Shelhamer}, E.},

   title={"{cuDNN: Efficient Primitives for Deep Learning}"},

   journal={ArXiv e-prints},

   archivePrefix={"arXiv"},

   eprint={1410.0759},

   keywords={Computer Science – Neural and Evolutionary Computing, Computer Science – Learning, Computer Science – Mathematical Software},

   year={2014},

   month={oct},

   adsurl={http://adsabs.harvard.edu/abs/2014arXiv1410.0759C},

   adsnote={Provided by the SAO/NASA Astrophysics Data System}

}

Download Download (PDF)   View View   Source Source   

4061

views

We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in the HPC community by libraries such as the Basic Linear Algebra Subroutines (BLAS). However, there is no analogous library for deep learning. Without such a library, researchers implementing deep learning workloads on parallel processors must create and optimize their own implementations of the main computational kernels, and this work must be repeated as new parallel processors emerge. To address this problem, we have created a library similar in intent to BLAS, with optimized routines for deep learning workloads. Our implementation contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms. The library is easy to integrate into existing frameworks, and provides optimized performance and memory usage. For example, integrating cuDNN into Caffe, a popular framework for convolutional networks, improves performance by 36% on a standard model while also reducing memory consumption.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: