Probabilistic models based on Hidden Markov models (HMM) for automatic word spacing that use characters n-grams, which is a sub-sequence of n characters in a given character sequence, are discussed. Automatic word spacing is a preprocessing techniques used for correcting boundaries between words in a sentence containing spacing errors. These model can be effectively applied to a natural language with a small character set, such as English, using character n-grams that are larger than trigrams. These models, which are language independent and can be effectively used for languages having word spacing, can also be used for word segmentation in the languages without explicit word spacing. These models, by generalizing the HMMs, can consider a broad context and estimate accurate probabilities.
ASJC Scopus subject areas
- Computer Networks and Communications
- Artificial Intelligence