Noise robustness of tract variables and their application to speech recognition

Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Contribution to journalConference article

7 Citations (Scopus)

Abstract

This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

Original languageEnglish
Pages (from-to)2759-2762
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2009 Nov 26
Externally publishedYes
Event10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009 - Brighton, United Kingdom
Duration: 2009 Sep 62009 Sep 10

    Fingerprint

Keywords

  • Gestural phonology
  • Kalman smoothing
  • Neural networks
  • Noise robust speech recognition
  • Speech inversion

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Sensory Systems

Cite this