27123

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Jie You, Jae-Won Chung, Mosharaf Chowdhury
University of Michigan
arXiv:2208.06102 [cs.LG], (12 Aug 2022)

@misc{https://doi.org/10.48550/arxiv.2208.06102,

   doi={10.48550/ARXIV.2208.06102},

   url={https://arxiv.org/abs/2208.06102},

   author={You, Jie and Chung, Jae-Won and Chowdhury, Mosharaf},

   keywords={Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training},

   publisher={arXiv},

   year={2022},

   copyright={arXiv.org perpetual, non-exclusive license}

}

Training deep neural networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%–75.8% for diverse workloads.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: