Automatic generation of structured hyperdocuments from document images

Ji Yeon Lee, Jeong Seon Park, Hyeran Byun, Jongsub Moon, Seong Whan Lee

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

Original languageEnglish
Pages (from-to)485-503
Number of pages19
JournalPattern Recognition
Volume35
Issue number2
DOIs
Publication statusPublished - 2002 Feb 1

Fingerprint

HTML
SGML
XML
World Wide Web
Internet
Experiments

Keywords

  • Document conversion
  • Document image understanding
  • Logical structure analysis
  • Multi-column document
  • Structured hyperdocument

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Automatic generation of structured hyperdocuments from document images. / Lee, Ji Yeon; Park, Jeong Seon; Byun, Hyeran; Moon, Jongsub; Lee, Seong Whan.

In: Pattern Recognition, Vol. 35, No. 2, 01.02.2002, p. 485-503.

Research output: Contribution to journalArticle

Lee, Ji Yeon ; Park, Jeong Seon ; Byun, Hyeran ; Moon, Jongsub ; Lee, Seong Whan. / Automatic generation of structured hyperdocuments from document images. In: Pattern Recognition. 2002 ; Vol. 35, No. 2. pp. 485-503.
@article{bec97cc2fbb144dc959a24e5aed3100d,
title = "Automatic generation of structured hyperdocuments from document images",
abstract = "As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.",
keywords = "Document conversion, Document image understanding, Logical structure analysis, Multi-column document, Structured hyperdocument",
author = "Lee, {Ji Yeon} and Park, {Jeong Seon} and Hyeran Byun and Jongsub Moon and Lee, {Seong Whan}",
year = "2002",
month = "2",
day = "1",
doi = "10.1016/S0031-3203(01)00026-7",
language = "English",
volume = "35",
pages = "485--503",
journal = "Pattern Recognition",
issn = "0031-3203",
publisher = "Elsevier Limited",
number = "2",

}

TY - JOUR

T1 - Automatic generation of structured hyperdocuments from document images

AU - Lee, Ji Yeon

AU - Park, Jeong Seon

AU - Byun, Hyeran

AU - Moon, Jongsub

AU - Lee, Seong Whan

PY - 2002/2/1

Y1 - 2002/2/1

N2 - As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

AB - As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

KW - Document conversion

KW - Document image understanding

KW - Logical structure analysis

KW - Multi-column document

KW - Structured hyperdocument

UR - http://www.scopus.com/inward/record.url?scp=0036467005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036467005&partnerID=8YFLogxK

U2 - 10.1016/S0031-3203(01)00026-7

DO - 10.1016/S0031-3203(01)00026-7

M3 - Article

VL - 35

SP - 485

EP - 503

JO - Pattern Recognition

JF - Pattern Recognition

SN - 0031-3203

IS - 2

ER -