An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

hgpu.org » Applications » Computer science » An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Ammar Ahmad Awan, Hari Subramoni, Dhabaleswar K. Panda

Dept. of Computer Science and Engg., The Ohio State University

The Machine Learning Conference, 2017

DOI:10.1145/3146347.3146356

BibTeX

Download (PDF)

View

Source

2644

views

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through sophisticated software frameworks like cuDNN and cuBLAS. However, recent enhancements to CPU-based hardware and software has the potential to significantly enhance the performance of CPU-based DL training. In this paper, we provide a complete performance landscape of CPU- and GPU-based DNN training. We characterize performance of DNN training for AlexNet and ResNet-50 for a wide-range of CPU and GPU architectures including the latest Intel Xeon Phi (Knights Landing) processors and NVIDIA Pascal GPUs. We also present multi-node DNN training performance results for AlexNet and ResNet-50 using Intel Machine Learning Scaling (MLSL) Library and Intel-Caffe. In addition, we provide a CPU vs. GPU comparison for multi-node training using OSU-Caffe and Intel-Caffe. To the best of our knowledge, this is the first study that dives deeper into the performance of DNN training in a holistic manner yet provides an in-depth look at layer-wise performance for different DNNs. We provide multiple key insights: 1) Convolutions account for the majority of time (up to 83% time) consumed in DNN training, 2) GPU-based training continues to deliver excellent performance (up to 18% better than KNL) across generations of GPU hardware and software, and 3) Recent CPU-based optimizations like MKL-DNN and OpenMP-based thread parallelism leads to excellent speed-ups over under-optimized designs (up to 3.2X improvement for AlexNet training).

Tags: Benchmarking, Caffe, Computer science, CUBLAS, CUDA, Deep learning, Intel Xeon Phi, Machine learning, nVidia, Tela K40, Tesla K80, Tesla P100

December 24, 2017 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org