An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Ammar Ahmad Awan, Hari Subramoni, Dhabaleswar K. Panda
Dept. of Computer Science and Engg., The Ohio State University
The Machine Learning Conference, 2017


   title={An In-depth Performance Characterization of CPU-and GPU-based DNN Training on Modern Architectures},

   author={Awan, Ammar Ahmad and Subramoni, Hari and Panda, Dhabaleswar K},



Download Download (PDF)   View View   Source Source   



Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through sophisticated software frameworks like cuDNN and cuBLAS. However, recent enhancements to CPU-based hardware and software has the potential to significantly enhance the performance of CPU-based DL training. In this paper, we provide a complete performance landscape of CPU- and GPU-based DNN training. We characterize performance of DNN training for AlexNet and ResNet-50 for a wide-range of CPU and GPU architectures including the latest Intel Xeon Phi (Knights Landing) processors and NVIDIA Pascal GPUs. We also present multi-node DNN training performance results for AlexNet and ResNet-50 using Intel Machine Learning Scaling (MLSL) Library and Intel-Caffe. In addition, we provide a CPU vs. GPU comparison for multi-node training using OSU-Caffe and Intel-Caffe. To the best of our knowledge, this is the first study that dives deeper into the performance of DNN training in a holistic manner yet provides an in-depth look at layer-wise performance for different DNNs. We provide multiple key insights: 1) Convolutions account for the majority of time (up to 83% time) consumed in DNN training, 2) GPU-based training continues to deliver excellent performance (up to 18% better than KNL) across generations of GPU hardware and software, and 3) Recent CPU-based optimizations like MKL-DNN and OpenMP-based thread parallelism leads to excellent speed-ups over under-optimized designs (up to 3.2X improvement for AlexNet training).
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: