SSFile

A novel column-store for efficient data analysis in Hadoop-based distributed systems

Jihoon Son, Hyoseok Ryu, Sungmin Yi, Yon Dohn Chung

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Recently, large-scale relational data analysis has gained much attention. Several Hadoop-based distributed systems have been proposed for scalable relational data analysis. Because the column-store approach is very suitable for analytic queries, many studies on column-oriented storage and query processing for Hadoop-based distributed systems have been conducted. However, two problems have arisen in existing studies, the first of which is that only a small amount of data is processed per task during distributed processing. Each task reads only the necessary data using the columnar structure. Because the task initialization in Hadoop requires a large overhead, it is inefficient that each task processes a small amount of data. The second problem is the lack of support for techniques that optimize columnar execution. Although many such techniques have been proposed for efficient columnar query execution, existing column-store methods for Hadoop-based distributed systems cannot support them efficiently. In this paper, we propose a novel column-store method called SSFile for Hadoop-based distributed systems. SSFile increases the actual amount of data processed per task and supports representative columnar execution techniques for efficient query processing. Through extensive experiments, we show that SSFile significantly improves the performance of distributed processing.

Original languageEnglish
Pages (from-to)68-86
Number of pages19
JournalInformation Sciences
Volume316
DOIs
Publication statusPublished - 2015 Sep 20

Fingerprint

Query processing
Distributed Systems
Data analysis
Distributed Processing
Query Processing
Processing
Query
Initialization
Experiments
Optimise
Necessary
Hadoop
Distributed systems
Experiment

Keywords

  • Column-store
  • Distributed systems
  • Hadoop
  • HDFS
  • Relational data analysis
  • Server clusters

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management

Cite this

SSFile : A novel column-store for efficient data analysis in Hadoop-based distributed systems. / Son, Jihoon; Ryu, Hyoseok; Yi, Sungmin; Chung, Yon Dohn.

In: Information Sciences, Vol. 316, 20.09.2015, p. 68-86.

Research output: Contribution to journalArticle

@article{6053469f1e914fbc9e1fbd433ebf9b01,
title = "SSFile: A novel column-store for efficient data analysis in Hadoop-based distributed systems",
abstract = "Recently, large-scale relational data analysis has gained much attention. Several Hadoop-based distributed systems have been proposed for scalable relational data analysis. Because the column-store approach is very suitable for analytic queries, many studies on column-oriented storage and query processing for Hadoop-based distributed systems have been conducted. However, two problems have arisen in existing studies, the first of which is that only a small amount of data is processed per task during distributed processing. Each task reads only the necessary data using the columnar structure. Because the task initialization in Hadoop requires a large overhead, it is inefficient that each task processes a small amount of data. The second problem is the lack of support for techniques that optimize columnar execution. Although many such techniques have been proposed for efficient columnar query execution, existing column-store methods for Hadoop-based distributed systems cannot support them efficiently. In this paper, we propose a novel column-store method called SSFile for Hadoop-based distributed systems. SSFile increases the actual amount of data processed per task and supports representative columnar execution techniques for efficient query processing. Through extensive experiments, we show that SSFile significantly improves the performance of distributed processing.",
keywords = "Column-store, Distributed systems, Hadoop, HDFS, Relational data analysis, Server clusters",
author = "Jihoon Son and Hyoseok Ryu and Sungmin Yi and Chung, {Yon Dohn}",
year = "2015",
month = "9",
day = "20",
doi = "10.1016/j.ins.2015.04.014",
language = "English",
volume = "316",
pages = "68--86",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - SSFile

T2 - A novel column-store for efficient data analysis in Hadoop-based distributed systems

AU - Son, Jihoon

AU - Ryu, Hyoseok

AU - Yi, Sungmin

AU - Chung, Yon Dohn

PY - 2015/9/20

Y1 - 2015/9/20

N2 - Recently, large-scale relational data analysis has gained much attention. Several Hadoop-based distributed systems have been proposed for scalable relational data analysis. Because the column-store approach is very suitable for analytic queries, many studies on column-oriented storage and query processing for Hadoop-based distributed systems have been conducted. However, two problems have arisen in existing studies, the first of which is that only a small amount of data is processed per task during distributed processing. Each task reads only the necessary data using the columnar structure. Because the task initialization in Hadoop requires a large overhead, it is inefficient that each task processes a small amount of data. The second problem is the lack of support for techniques that optimize columnar execution. Although many such techniques have been proposed for efficient columnar query execution, existing column-store methods for Hadoop-based distributed systems cannot support them efficiently. In this paper, we propose a novel column-store method called SSFile for Hadoop-based distributed systems. SSFile increases the actual amount of data processed per task and supports representative columnar execution techniques for efficient query processing. Through extensive experiments, we show that SSFile significantly improves the performance of distributed processing.

AB - Recently, large-scale relational data analysis has gained much attention. Several Hadoop-based distributed systems have been proposed for scalable relational data analysis. Because the column-store approach is very suitable for analytic queries, many studies on column-oriented storage and query processing for Hadoop-based distributed systems have been conducted. However, two problems have arisen in existing studies, the first of which is that only a small amount of data is processed per task during distributed processing. Each task reads only the necessary data using the columnar structure. Because the task initialization in Hadoop requires a large overhead, it is inefficient that each task processes a small amount of data. The second problem is the lack of support for techniques that optimize columnar execution. Although many such techniques have been proposed for efficient columnar query execution, existing column-store methods for Hadoop-based distributed systems cannot support them efficiently. In this paper, we propose a novel column-store method called SSFile for Hadoop-based distributed systems. SSFile increases the actual amount of data processed per task and supports representative columnar execution techniques for efficient query processing. Through extensive experiments, we show that SSFile significantly improves the performance of distributed processing.

KW - Column-store

KW - Distributed systems

KW - Hadoop

KW - HDFS

KW - Relational data analysis

KW - Server clusters

UR - http://www.scopus.com/inward/record.url?scp=84930060594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930060594&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2015.04.014

DO - 10.1016/j.ins.2015.04.014

M3 - Article

VL - 316

SP - 68

EP - 86

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -