Tajo: A distributed data warehouse system on large clusters

Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Yon Dohn Chung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

Original languageEnglish
Title of host publicationProceedings - International Conference on Data Engineering
Pages1320-1323
Number of pages4
DOIs
Publication statusPublished - 2013 Aug 15
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: 2013 Apr 82013 Apr 11

Other

Other29th International Conference on Data Engineering, ICDE 2013
CountryAustralia
CityBrisbane, QLD
Period13/4/813/4/11

Fingerprint

Data warehouses
Query processing
Engines
User interfaces
Demonstrations
Planning

ASJC Scopus subject areas

  • Information Systems
  • Signal Processing
  • Software

Cite this

Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., & Chung, Y. D. (2013). Tajo: A distributed data warehouse system on large clusters. In Proceedings - International Conference on Data Engineering (pp. 1320-1323). [6544934] https://doi.org/10.1109/ICDE.2013.6544934

Tajo : A distributed data warehouse system on large clusters. / Choi, Hyunsik; Son, Jihoon; Yang, Haemi; Ryu, Hyoseok; Lim, Byungnam; Kim, Soohyung; Chung, Yon Dohn.

Proceedings - International Conference on Data Engineering. 2013. p. 1320-1323 6544934.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Choi, H, Son, J, Yang, H, Ryu, H, Lim, B, Kim, S & Chung, YD 2013, Tajo: A distributed data warehouse system on large clusters. in Proceedings - International Conference on Data Engineering., 6544934, pp. 1320-1323, 29th International Conference on Data Engineering, ICDE 2013, Brisbane, QLD, Australia, 13/4/8. https://doi.org/10.1109/ICDE.2013.6544934
Choi H, Son J, Yang H, Ryu H, Lim B, Kim S et al. Tajo: A distributed data warehouse system on large clusters. In Proceedings - International Conference on Data Engineering. 2013. p. 1320-1323. 6544934 https://doi.org/10.1109/ICDE.2013.6544934
Choi, Hyunsik ; Son, Jihoon ; Yang, Haemi ; Ryu, Hyoseok ; Lim, Byungnam ; Kim, Soohyung ; Chung, Yon Dohn. / Tajo : A distributed data warehouse system on large clusters. Proceedings - International Conference on Data Engineering. 2013. pp. 1320-1323
@inproceedings{2179ebdb11234d6581c81670cb436c8e,
title = "Tajo: A distributed data warehouse system on large clusters",
abstract = "The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.",
author = "Hyunsik Choi and Jihoon Son and Haemi Yang and Hyoseok Ryu and Byungnam Lim and Soohyung Kim and Chung, {Yon Dohn}",
year = "2013",
month = "8",
day = "15",
doi = "10.1109/ICDE.2013.6544934",
language = "English",
isbn = "9781467349086",
pages = "1320--1323",
booktitle = "Proceedings - International Conference on Data Engineering",

}

TY - GEN

T1 - Tajo

T2 - A distributed data warehouse system on large clusters

AU - Choi, Hyunsik

AU - Son, Jihoon

AU - Yang, Haemi

AU - Ryu, Hyoseok

AU - Lim, Byungnam

AU - Kim, Soohyung

AU - Chung, Yon Dohn

PY - 2013/8/15

Y1 - 2013/8/15

N2 - The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

AB - The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

UR - http://www.scopus.com/inward/record.url?scp=84881332659&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881332659&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2013.6544934

DO - 10.1109/ICDE.2013.6544934

M3 - Conference contribution

AN - SCOPUS:84881332659

SN - 9781467349086

SP - 1320

EP - 1323

BT - Proceedings - International Conference on Data Engineering

ER -