Acquiring Korean lexical entry from a raw corpus

Wonhee Yu, Kinam Park, Soon Young Jung, Heui Seok Lim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a computational lexical entry acquisition model based on a representation model of the mental lexicon. The proposed model acquires lexical entries from a raw corpus by unsupervised learning like human. The model is composed of full-form and morpheme acquisition modules. In the full-from acquisition module, core full-forms are automatically acquired according to the frequency and recency thresholds. In the morpheme acquisition module, a repeatedly occurring substring in different full-forms is chosen as a candidate morpheme. Then, the candidate is corroborated as a morpheme by using the entropy measure of syllables in the string. The experimental results with a Korean corpus of which size is about 16 million full-forms show that the model successively acquires major full-forms and morphemes with the precision of 100% and 99.04%, respectively.

Original languageEnglish
Title of host publication2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010
DOIs
Publication statusPublished - 2010 Nov 11
Event2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010 - Cebu, Philippines
Duration: 2010 Aug 112010 Aug 13

Other

Other2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010
CountryPhilippines
CityCebu
Period10/8/1110/8/13

    Fingerprint

Keywords

  • Language learning
  • Lexical acquisition
  • Machine readable dictionary
  • Mental lexicon

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Cite this

Yu, W., Park, K., Jung, S. Y., & Lim, H. S. (2010). Acquiring Korean lexical entry from a raw corpus. In 2010 2nd International Conference on Information Technology Convergence and Services, ITCS 2010 [5581289] https://doi.org/10.1109/ITCS.2010.5581289