Smoothing algorithm for n-gram model using agglutinative characteristic of korean

Jae Hyun Park, Young In Song, Hae Chang Rim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Smoothing for an n-gram language model is an algorithm that can assign a non-zero probability to an unseen n-gram. Smoothing is an essential technique for an n-gram language model due to the data sparseness problem. However, in some circumstances it assigns an improper amount of probability to unseen n-grams. In this paper, we present a novel method that adjusts the improperly assigned probabilities of unseen n-grams by taking advantage of the agglutinative characteristics of Korean language. In Korean, the grammatically proper class of a morpheme can be predicted by knowing the previous morpheme. By using this characteristic, we try to prevent grammatically improper n-grams from achieving relatively higher probability and to assign more probability mass to proper n-grams. Experimental results show that the proposed method can achieve 8.6% - 12.5% perplexity reductions for Katz backoff algorithm and 4.9% - 7.0% perplexity reductions for Kneser-Ney Smoothing.

Original languageEnglish
Title of host publicationICSC 2007 International Conference on Semantic Computing
Pages397-404
Number of pages8
DOIs
Publication statusPublished - 2007
EventICSC 2007 International Conference on Semantic Computing - Irvine CA, United States
Duration: 2007 Sep 172007 Sep 19

Publication series

NameICSC 2007 International Conference on Semantic Computing

Other

OtherICSC 2007 International Conference on Semantic Computing
CountryUnited States
CityIrvine CA
Period07/9/1707/9/19

ASJC Scopus subject areas

  • Computer Science(all)
  • Computer Science Applications

Cite this

Park, J. H., Song, Y. I., & Rim, H. C. (2007). Smoothing algorithm for n-gram model using agglutinative characteristic of korean. In ICSC 2007 International Conference on Semantic Computing (pp. 397-404). [4338374] (ICSC 2007 International Conference on Semantic Computing). https://doi.org/10.1109/ICSC.2007.66