Parameter-free geometric document layout analysis

Seong Whan Lee, Dae Seok Ryu

Research output: Contribution to journalArticle

56 Citations (Scopus)

Abstract

Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. In this paper, we propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than the previous ones.

Original languageEnglish
Pages (from-to)1240-1256
Number of pages17
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume23
Issue number11
DOIs
Publication statusPublished - 2001 Nov 1

Fingerprint

Layout
Multiscale Analysis
Textures
Periodicity
Segmentation
Texture Analysis
Quadtree
Algorithm Analysis
Robust Methods
Ambiguous
Databases
Tables
Attribute
Electronics
Text
Line
Character

Keywords

  • Geometric document layout analysis
  • Multiscale analysis
  • Page segmentation
  • Parameter-free method
  • Periodicity estimation

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Computer Vision and Pattern Recognition

Cite this

Parameter-free geometric document layout analysis. / Lee, Seong Whan; Ryu, Dae Seok.

In: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 11, 01.11.2001, p. 1240-1256.

Research output: Contribution to journalArticle

@article{7cce87a564f1473db744c5f37eea2128,
title = "Parameter-free geometric document layout analysis",
abstract = "Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. In this paper, we propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than the previous ones.",
keywords = "Geometric document layout analysis, Multiscale analysis, Page segmentation, Parameter-free method, Periodicity estimation",
author = "Lee, {Seong Whan} and Ryu, {Dae Seok}",
year = "2001",
month = "11",
day = "1",
doi = "10.1109/34.969115",
language = "English",
volume = "23",
pages = "1240--1256",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
issn = "0162-8828",
publisher = "IEEE Computer Society",
number = "11",

}

TY - JOUR

T1 - Parameter-free geometric document layout analysis

AU - Lee, Seong Whan

AU - Ryu, Dae Seok

PY - 2001/11/1

Y1 - 2001/11/1

N2 - Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. In this paper, we propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than the previous ones.

AB - Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. In this paper, we propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than the previous ones.

KW - Geometric document layout analysis

KW - Multiscale analysis

KW - Page segmentation

KW - Parameter-free method

KW - Periodicity estimation

UR - http://www.scopus.com/inward/record.url?scp=0035510433&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035510433&partnerID=8YFLogxK

U2 - 10.1109/34.969115

DO - 10.1109/34.969115

M3 - Article

VL - 23

SP - 1240

EP - 1256

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

SN - 0162-8828

IS - 11

ER -