A novel word segmentation approach for written languages with word boundary markers

Han Cheol Cho, Do Gil Lee, Jung Tae Lee, Pontus Stenetorp, Jun'ichi Tsujii, Hae-Chang Rim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.

Original languageEnglish
Title of host publicationACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.
Pages29-32
Number of pages4
Publication statusPublished - 2009 Dec 1
EventJoint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009 - Suntec, Singapore
Duration: 2009 Aug 22009 Aug 7

Other

OtherJoint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009
CountrySingapore
CitySuntec
Period09/8/209/8/7

Fingerprint

written language
SMS
segmentation
Spacing
Novel Words
Word Segmentation
Written Language
e-mail
weblog
language

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Cho, H. C., Lee, D. G., Lee, J. T., Stenetorp, P., Tsujii, J., & Rim, H-C. (2009). A novel word segmentation approach for written languages with word boundary markers. In ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf. (pp. 29-32)

A novel word segmentation approach for written languages with word boundary markers. / Cho, Han Cheol; Lee, Do Gil; Lee, Jung Tae; Stenetorp, Pontus; Tsujii, Jun'ichi; Rim, Hae-Chang.

ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.. 2009. p. 29-32.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cho, HC, Lee, DG, Lee, JT, Stenetorp, P, Tsujii, J & Rim, H-C 2009, A novel word segmentation approach for written languages with word boundary markers. in ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.. pp. 29-32, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009, Suntec, Singapore, 09/8/2.
Cho HC, Lee DG, Lee JT, Stenetorp P, Tsujii J, Rim H-C. A novel word segmentation approach for written languages with word boundary markers. In ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.. 2009. p. 29-32
Cho, Han Cheol ; Lee, Do Gil ; Lee, Jung Tae ; Stenetorp, Pontus ; Tsujii, Jun'ichi ; Rim, Hae-Chang. / A novel word segmentation approach for written languages with word boundary markers. ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.. 2009. pp. 29-32
@inproceedings{1314639a3c1c46abb4fd59fb9f52a437,
title = "A novel word segmentation approach for written languages with word boundary markers",
abstract = "Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10{\%} spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.",
author = "Cho, {Han Cheol} and Lee, {Do Gil} and Lee, {Jung Tae} and Pontus Stenetorp and Jun'ichi Tsujii and Hae-Chang Rim",
year = "2009",
month = "12",
day = "1",
language = "English",
isbn = "9781617382581",
pages = "29--32",
booktitle = "ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.",

}

TY - GEN

T1 - A novel word segmentation approach for written languages with word boundary markers

AU - Cho, Han Cheol

AU - Lee, Do Gil

AU - Lee, Jung Tae

AU - Stenetorp, Pontus

AU - Tsujii, Jun'ichi

AU - Rim, Hae-Chang

PY - 2009/12/1

Y1 - 2009/12/1

N2 - Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.

AB - Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.

UR - http://www.scopus.com/inward/record.url?scp=84859908122&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859908122&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84859908122

SN - 9781617382581

SP - 29

EP - 32

BT - ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.

ER -