Examining the impact of data-access cost on XML twig pattern matching

Sang-Geun Lee, Byung Gul Ryu, Kun Lung Wu

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

To process a large size of XML document, data-access time dominates the whole system performance in most cases. However, few techniques exist today that optimize the data-access cost of performing twig pattern matching. TJFast [18] is one of the few that do. TJFast could reduce the number of elements scanned by deriving all the element names along the path from the root to the element with the extended Dewey label of an element alone. However, there is still much room for improvement. We empirically observe that (1) many irrelevant elements can still be accessed and processed by TJFast, unnecessarily incurring both data-access and computation overhead, and (2) there still exists substantial redundant label-to-element name decoding, needlessly increasing processing cost. In this paper, we present TJFast-BNS, an optimization of TJFast, to further reduce the data-access cost of twig pattern matching. TJFast-BNS efficiently identifies and filters out many irrelevant elements by introducing a new labeling scheme, termed E2Dewey, and a novel pointer structure. E2Dewey includes the total number of children of an element in the element's label. This is used to quickly identify unnecessary paths. The pointer structure to the descendants of a branching element supports random access to leaf and non-top branching elements. Extensive performance studies on various datasets clearly show that our approach accesses much fewer elements to process a twig query than others, leading to a superior performance gain in execution time.

Original languageEnglish
Pages (from-to)24-43
Number of pages20
JournalInformation Sciences
Volume203
DOIs
Publication statusPublished - 2012 Oct 25

Fingerprint

Pattern matching
Pattern Matching
XML
Labels
Costs
Branching
Labeling
Decoding
Labeling Scheme
Path
Random Access
Execution Time
System Performance
Processing
Leaves
Optimise
Roots
Query
Filter
Optimization

Keywords

  • Branching node stream
  • Labeling scheme
  • Leaf node stream
  • Pointer structure
  • XML twig pattern matching

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management

Cite this

Examining the impact of data-access cost on XML twig pattern matching. / Lee, Sang-Geun; Ryu, Byung Gul; Wu, Kun Lung.

In: Information Sciences, Vol. 203, 25.10.2012, p. 24-43.

Research output: Contribution to journalArticle

Lee, Sang-Geun ; Ryu, Byung Gul ; Wu, Kun Lung. / Examining the impact of data-access cost on XML twig pattern matching. In: Information Sciences. 2012 ; Vol. 203. pp. 24-43.
@article{70ed3887dccd483b94a21cf947a401bc,
title = "Examining the impact of data-access cost on XML twig pattern matching",
abstract = "To process a large size of XML document, data-access time dominates the whole system performance in most cases. However, few techniques exist today that optimize the data-access cost of performing twig pattern matching. TJFast [18] is one of the few that do. TJFast could reduce the number of elements scanned by deriving all the element names along the path from the root to the element with the extended Dewey label of an element alone. However, there is still much room for improvement. We empirically observe that (1) many irrelevant elements can still be accessed and processed by TJFast, unnecessarily incurring both data-access and computation overhead, and (2) there still exists substantial redundant label-to-element name decoding, needlessly increasing processing cost. In this paper, we present TJFast-BNS, an optimization of TJFast, to further reduce the data-access cost of twig pattern matching. TJFast-BNS efficiently identifies and filters out many irrelevant elements by introducing a new labeling scheme, termed E2Dewey, and a novel pointer structure. E2Dewey includes the total number of children of an element in the element's label. This is used to quickly identify unnecessary paths. The pointer structure to the descendants of a branching element supports random access to leaf and non-top branching elements. Extensive performance studies on various datasets clearly show that our approach accesses much fewer elements to process a twig query than others, leading to a superior performance gain in execution time.",
keywords = "Branching node stream, Labeling scheme, Leaf node stream, Pointer structure, XML twig pattern matching",
author = "Sang-Geun Lee and Ryu, {Byung Gul} and Wu, {Kun Lung}",
year = "2012",
month = "10",
day = "25",
doi = "10.1016/j.ins.2012.03.011",
language = "English",
volume = "203",
pages = "24--43",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Examining the impact of data-access cost on XML twig pattern matching

AU - Lee, Sang-Geun

AU - Ryu, Byung Gul

AU - Wu, Kun Lung

PY - 2012/10/25

Y1 - 2012/10/25

N2 - To process a large size of XML document, data-access time dominates the whole system performance in most cases. However, few techniques exist today that optimize the data-access cost of performing twig pattern matching. TJFast [18] is one of the few that do. TJFast could reduce the number of elements scanned by deriving all the element names along the path from the root to the element with the extended Dewey label of an element alone. However, there is still much room for improvement. We empirically observe that (1) many irrelevant elements can still be accessed and processed by TJFast, unnecessarily incurring both data-access and computation overhead, and (2) there still exists substantial redundant label-to-element name decoding, needlessly increasing processing cost. In this paper, we present TJFast-BNS, an optimization of TJFast, to further reduce the data-access cost of twig pattern matching. TJFast-BNS efficiently identifies and filters out many irrelevant elements by introducing a new labeling scheme, termed E2Dewey, and a novel pointer structure. E2Dewey includes the total number of children of an element in the element's label. This is used to quickly identify unnecessary paths. The pointer structure to the descendants of a branching element supports random access to leaf and non-top branching elements. Extensive performance studies on various datasets clearly show that our approach accesses much fewer elements to process a twig query than others, leading to a superior performance gain in execution time.

AB - To process a large size of XML document, data-access time dominates the whole system performance in most cases. However, few techniques exist today that optimize the data-access cost of performing twig pattern matching. TJFast [18] is one of the few that do. TJFast could reduce the number of elements scanned by deriving all the element names along the path from the root to the element with the extended Dewey label of an element alone. However, there is still much room for improvement. We empirically observe that (1) many irrelevant elements can still be accessed and processed by TJFast, unnecessarily incurring both data-access and computation overhead, and (2) there still exists substantial redundant label-to-element name decoding, needlessly increasing processing cost. In this paper, we present TJFast-BNS, an optimization of TJFast, to further reduce the data-access cost of twig pattern matching. TJFast-BNS efficiently identifies and filters out many irrelevant elements by introducing a new labeling scheme, termed E2Dewey, and a novel pointer structure. E2Dewey includes the total number of children of an element in the element's label. This is used to quickly identify unnecessary paths. The pointer structure to the descendants of a branching element supports random access to leaf and non-top branching elements. Extensive performance studies on various datasets clearly show that our approach accesses much fewer elements to process a twig query than others, leading to a superior performance gain in execution time.

KW - Branching node stream

KW - Labeling scheme

KW - Leaf node stream

KW - Pointer structure

KW - XML twig pattern matching

UR - http://www.scopus.com/inward/record.url?scp=84861097832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84861097832&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2012.03.011

DO - 10.1016/j.ins.2012.03.011

M3 - Article

AN - SCOPUS:84861097832

VL - 203

SP - 24

EP - 43

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -