Prediction of the Resource Consumption of Distributed Deep Learning Systems

Gyeongsik Yang, Changyong Shin, Jeunghwan Lee, Yeonho Yoo, Chuck Yoo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Predicting resource consumption for the distributed training of deep learning models is of paramount importance, as it can inform a priori users of how long their training would take and enable users to manage the cost of training. Yet, no such prediction is available for users because the resource consumption itself varies significantly according to "settings"such as GPU types and also by "workloads"like deep learning models. Previous studies have attempted to derive or model such a prediction, but they fall short of accommodating the various combinations of settings and workloads together. This study presents Driple, which designs graph neural networks to predict the resource consumption of diverse workloads. Driple also designs transfer learning to extend the graph neural networks to adapt to differences in settings. The evaluation results show that Driple effectively predicts a wide range of workloads and settings. In addition, Driple can efficiently reduce the time required to tailor the prediction for different settings by up to 7.3×.

Original languageEnglish
Title of host publicationSIGMETRICS/PERFORMANCE 2022 - Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
PublisherAssociation for Computing Machinery, Inc
Pages69-70
Number of pages2
ISBN (Electronic)9781450391412
DOIs
Publication statusPublished - 2022 Jun 6
Event2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2022 - Virtual, Online, India
Duration: 2022 Jun 62022 Jun 10

Publication series

NameSIGMETRICS/PERFORMANCE 2022 - Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Conference

Conference2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2022
Country/TerritoryIndia
CityVirtual, Online
Period22/6/622/6/10

Keywords

  • distributed deep learning
  • graph neural networks
  • resource prediction
  • training time prediction
  • transfer learning

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Prediction of the Resource Consumption of Distributed Deep Learning Systems'. Together they form a unique fingerprint.

Cite this