Design of HTML parallel parser with semantic-based input splitting

Jihyun Lee, Yeoul Na, Seon Wook Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

HTML is a widely used markup language to make up innumerable web pages. Parallelization of a HTML parser would lead to consequential performance improvement and a better user experience. However, parallelizing the HTML parser is challenging because of a strong cyclic dependence in the parser model. In this paper, we propose a semantic-based HTML parallel parser design that splits the input HTML document by a 'div' tag, and processes the independent partial inputs with multiple parser threads. We evaluated the proposed HTML parallel parser with the benchmarks selected from top 500 web pages and achieved a maximum speedup of 1.49x.

Original languageEnglish
Title of host publicationInternational Conference on Electronics, Information, and Communications, ICEIC 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781467380164
DOIs
Publication statusPublished - 2016 Sep 7
Event15th International Conference on Electronics, Information, and Communications, ICEIC 2016 - Danang, Viet Nam
Duration: 2016 Jan 272016 Jan 30

Other

Other15th International Conference on Electronics, Information, and Communications, ICEIC 2016
CountryViet Nam
CityDanang
Period16/1/2716/1/30

Fingerprint

HTML
Semantics
Websites
Markup languages

Keywords

  • HTML
  • multithread
  • parallelizing

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Control and Systems Engineering

Cite this

Lee, J., Na, Y., & Kim, S. W. (2016). Design of HTML parallel parser with semantic-based input splitting. In International Conference on Electronics, Information, and Communications, ICEIC 2016 [7563004] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ELINFOCOM.2016.7563004

Design of HTML parallel parser with semantic-based input splitting. / Lee, Jihyun; Na, Yeoul; Kim, Seon Wook.

International Conference on Electronics, Information, and Communications, ICEIC 2016. Institute of Electrical and Electronics Engineers Inc., 2016. 7563004.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, J, Na, Y & Kim, SW 2016, Design of HTML parallel parser with semantic-based input splitting. in International Conference on Electronics, Information, and Communications, ICEIC 2016., 7563004, Institute of Electrical and Electronics Engineers Inc., 15th International Conference on Electronics, Information, and Communications, ICEIC 2016, Danang, Viet Nam, 16/1/27. https://doi.org/10.1109/ELINFOCOM.2016.7563004
Lee J, Na Y, Kim SW. Design of HTML parallel parser with semantic-based input splitting. In International Conference on Electronics, Information, and Communications, ICEIC 2016. Institute of Electrical and Electronics Engineers Inc. 2016. 7563004 https://doi.org/10.1109/ELINFOCOM.2016.7563004
Lee, Jihyun ; Na, Yeoul ; Kim, Seon Wook. / Design of HTML parallel parser with semantic-based input splitting. International Conference on Electronics, Information, and Communications, ICEIC 2016. Institute of Electrical and Electronics Engineers Inc., 2016.
@inproceedings{6a6eeb6e125d4d9093ccb6c5faa9970d,
title = "Design of HTML parallel parser with semantic-based input splitting",
abstract = "HTML is a widely used markup language to make up innumerable web pages. Parallelization of a HTML parser would lead to consequential performance improvement and a better user experience. However, parallelizing the HTML parser is challenging because of a strong cyclic dependence in the parser model. In this paper, we propose a semantic-based HTML parallel parser design that splits the input HTML document by a 'div' tag, and processes the independent partial inputs with multiple parser threads. We evaluated the proposed HTML parallel parser with the benchmarks selected from top 500 web pages and achieved a maximum speedup of 1.49x.",
keywords = "HTML, multithread, parallelizing",
author = "Jihyun Lee and Yeoul Na and Kim, {Seon Wook}",
year = "2016",
month = "9",
day = "7",
doi = "10.1109/ELINFOCOM.2016.7563004",
language = "English",
booktitle = "International Conference on Electronics, Information, and Communications, ICEIC 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Design of HTML parallel parser with semantic-based input splitting

AU - Lee, Jihyun

AU - Na, Yeoul

AU - Kim, Seon Wook

PY - 2016/9/7

Y1 - 2016/9/7

N2 - HTML is a widely used markup language to make up innumerable web pages. Parallelization of a HTML parser would lead to consequential performance improvement and a better user experience. However, parallelizing the HTML parser is challenging because of a strong cyclic dependence in the parser model. In this paper, we propose a semantic-based HTML parallel parser design that splits the input HTML document by a 'div' tag, and processes the independent partial inputs with multiple parser threads. We evaluated the proposed HTML parallel parser with the benchmarks selected from top 500 web pages and achieved a maximum speedup of 1.49x.

AB - HTML is a widely used markup language to make up innumerable web pages. Parallelization of a HTML parser would lead to consequential performance improvement and a better user experience. However, parallelizing the HTML parser is challenging because of a strong cyclic dependence in the parser model. In this paper, we propose a semantic-based HTML parallel parser design that splits the input HTML document by a 'div' tag, and processes the independent partial inputs with multiple parser threads. We evaluated the proposed HTML parallel parser with the benchmarks selected from top 500 web pages and achieved a maximum speedup of 1.49x.

KW - HTML

KW - multithread

KW - parallelizing

UR - http://www.scopus.com/inward/record.url?scp=84988884713&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988884713&partnerID=8YFLogxK

U2 - 10.1109/ELINFOCOM.2016.7563004

DO - 10.1109/ELINFOCOM.2016.7563004

M3 - Conference contribution

BT - International Conference on Electronics, Information, and Communications, ICEIC 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -