Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
Facebook, Inc
arXiv:2201.07821 [cs.LG], (19 Jan 2022)
@misc{lin2022building,
title={Building a Performance Model for Deep Learning Recommendation Model Training on GPUs},
author={Zhongyi Lin and Louis Feng and Ehsan K. Ardestani and Jaewon Lee and John Lundell and Changkyu Kim and Arun Kejariwal and John D. Owens},
year={2022},
eprint={2201.07821},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors for GPU active time and overall end-to-end per-batch training time prediction, respectively. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors, but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analyses, we show our system can provide more general model-system co-design than previous methods.
January 23, 2022 by hgpu