Real-time continuous phoneme recognition system using class-dependent tied-mixture HMM with HBT structure for speech-driven lip-sync

Junho Park, Hanseok Ko

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

This work describes a real-time lip-sync method using which an avatar's lip shape is synchronized with the corresponding speech signal. Phoneme recognition is generally regarded as an important task in the operation of a real-time lip-sync system. In this work, the use of the Head-Body-Tail (HBT) model is proposed for the purpose of more efficiently recognizing phonemes which are variously uttered due to co-articulation effects. The HBT model effectively deals with the transition parts of context-dependent models for small-sized vocabulary tasks. These models provide better recognition performance than general context-dependent or context-independent models for the task of digit or vowel recognition. Moreover, each phoneme is categorized into one among four classes and the class-dependent codebook is generated to further improve the performance. Additionally, for the clear representation of the context dependency information in the transient parts, some Gaussians are excluded from class-dependent codebook. The proposed method leads to a lip-sync system that performs at a level that is similar to previous designs based on HBT and continuous hidden Markov models (CHMMs). However, our method reduces the number of model parameters by one-third and enables real-time operation.

Original languageEnglish
Article number4668497
Pages (from-to)1299-1306
Number of pages8
JournalIEEE Transactions on Multimedia
Volume10
Issue number7
DOIs
Publication statusPublished - 2008 Nov

Keywords

  • Head-body-tail HMM
  • Phoneme recognition
  • Real-time lip-sync

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Real-time continuous phoneme recognition system using class-dependent tied-mixture HMM with HBT structure for speech-driven lip-sync'. Together they form a unique fingerprint.

  • Cite this