TY - GEN
T1 - Prediction of the Resource Consumption of Distributed Deep Learning Systems
AU - Yang, Gyeongsik
AU - Shin, Changyong
AU - Lee, Jeunghwan
AU - Yoo, Yeonho
AU - Yoo, Chuck
N1 - Funding Information:
This work was supported by Institute of Information & communications Technology Planning & Evaluation grant funded by the Korea government (Ministry of Science and ICT) (2015-0-00280) and by Basic Science Research Program through the NRF funded by the Ministry of Education (NRF-2021R1A6A1A13044830).
Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/6/6
Y1 - 2022/6/6
N2 - Predicting resource consumption for the distributed training of deep learning models is of paramount importance, as it can inform a priori users of how long their training would take and enable users to manage the cost of training. Yet, no such prediction is available for users because the resource consumption itself varies significantly according to "settings"such as GPU types and also by "workloads"like deep learning models. Previous studies have attempted to derive or model such a prediction, but they fall short of accommodating the various combinations of settings and workloads together. This study presents Driple, which designs graph neural networks to predict the resource consumption of diverse workloads. Driple also designs transfer learning to extend the graph neural networks to adapt to differences in settings. The evaluation results show that Driple effectively predicts a wide range of workloads and settings. In addition, Driple can efficiently reduce the time required to tailor the prediction for different settings by up to 7.3×.
AB - Predicting resource consumption for the distributed training of deep learning models is of paramount importance, as it can inform a priori users of how long their training would take and enable users to manage the cost of training. Yet, no such prediction is available for users because the resource consumption itself varies significantly according to "settings"such as GPU types and also by "workloads"like deep learning models. Previous studies have attempted to derive or model such a prediction, but they fall short of accommodating the various combinations of settings and workloads together. This study presents Driple, which designs graph neural networks to predict the resource consumption of diverse workloads. Driple also designs transfer learning to extend the graph neural networks to adapt to differences in settings. The evaluation results show that Driple effectively predicts a wide range of workloads and settings. In addition, Driple can efficiently reduce the time required to tailor the prediction for different settings by up to 7.3×.
KW - distributed deep learning
KW - graph neural networks
KW - resource prediction
KW - training time prediction
KW - transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85132196416&partnerID=8YFLogxK
U2 - 10.1145/3489048.3530962
DO - 10.1145/3489048.3530962
M3 - Conference contribution
AN - SCOPUS:85132196416
T3 - SIGMETRICS/PERFORMANCE 2022 - Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
SP - 69
EP - 70
BT - SIGMETRICS/PERFORMANCE 2022 - Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
PB - Association for Computing Machinery, Inc
T2 - 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2022
Y2 - 6 June 2022 through 10 June 2022
ER -