TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

Seil Lee, Hanjoo Kim, Jaehong Park, Jaehee Jang, Chang-Sung Jeong, Sungroh Yoon

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67% with faster convergence.

Original languageEnglish
Pages (from-to)27671-27680
Number of pages10
JournalIEEE Access
Volume6
DOIs
Publication statusPublished - 2018 May 29

Fingerprint

Electric sparks
Recurrent neural networks
Telecommunication traffic
Learning systems
Servers
Agglomeration
Pipelines
Neural networks
Hardware
Communication
Deep learning
Deep neural networks

Keywords

  • Apache Spark
  • commodity servers
  • deep learning
  • distributed system
  • TensorLightning

ASJC Scopus subject areas

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)

Cite this

TensorLightning : A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters. / Lee, Seil; Kim, Hanjoo; Park, Jaehong; Jang, Jaehee; Jeong, Chang-Sung; Yoon, Sungroh.

In: IEEE Access, Vol. 6, 29.05.2018, p. 27671-27680.

Research output: Contribution to journalArticle

Lee, Seil ; Kim, Hanjoo ; Park, Jaehong ; Jang, Jaehee ; Jeong, Chang-Sung ; Yoon, Sungroh. / TensorLightning : A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters. In: IEEE Access. 2018 ; Vol. 6. pp. 27671-27680.
@article{577cecf8104d430bbc887625ca7f896d,
title = "TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters",
abstract = "With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67{\%} with faster convergence.",
keywords = "Apache Spark, commodity servers, deep learning, distributed system, TensorLightning",
author = "Seil Lee and Hanjoo Kim and Jaehong Park and Jaehee Jang and Chang-Sung Jeong and Sungroh Yoon",
year = "2018",
month = "5",
day = "29",
doi = "10.1109/ACCESS.2018.2842103",
language = "English",
volume = "6",
pages = "27671--27680",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - TensorLightning

T2 - A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

AU - Lee, Seil

AU - Kim, Hanjoo

AU - Park, Jaehong

AU - Jang, Jaehee

AU - Jeong, Chang-Sung

AU - Yoon, Sungroh

PY - 2018/5/29

Y1 - 2018/5/29

N2 - With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67% with faster convergence.

AB - With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67% with faster convergence.

KW - Apache Spark

KW - commodity servers

KW - deep learning

KW - distributed system

KW - TensorLightning

UR - http://www.scopus.com/inward/record.url?scp=85047817358&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85047817358&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2018.2842103

DO - 10.1109/ACCESS.2018.2842103

M3 - Article

AN - SCOPUS:85047817358

VL - 6

SP - 27671

EP - 27680

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

ER -