Noise robustness of tract variables and their application to speech recognition

Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Contribution to journalConference article

7 Citations (Scopus)

Abstract

This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

Original languageEnglish
Pages (from-to)2759-2762
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2009 Nov 26
Externally publishedYes
Event10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009 - Brighton, United Kingdom
Duration: 2009 Sep 62009 Sep 10

Fingerprint

Speech recognition
Noise
Tongue
Constriction
Acoustic noise
Lip
Speech enhancement
Subways
Railroads
Glottis
Signal to noise ratio
Railroad cars
Experiments
Signal-To-Noise Ratio
Neural networks

Keywords

  • Gestural phonology
  • Kalman smoothing
  • Neural networks
  • Noise robust speech recognition
  • Speech inversion

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Sensory Systems

Cite this

Noise robustness of tract variables and their application to speech recognition. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol; Saltzman, Elliot; Goldstein, Louis.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 26.11.2009, p. 2759-2762.

Research output: Contribution to journalConference article

@article{041df63c94c347a98454a3d88348a943,
title = "Noise robustness of tract variables and their application to speech recognition",
abstract = "This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22{\%} and 21{\%} improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.",
keywords = "Gestural phonology, Kalman smoothing, Neural networks, Noise robust speech recognition, Speech inversion",
author = "Vikramjit Mitra and Hosung Nam and Carol Espy-Wilson and Elliot Saltzman and Louis Goldstein",
year = "2009",
month = "11",
day = "26",
language = "English",
pages = "2759--2762",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Noise robustness of tract variables and their application to speech recognition

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol

AU - Saltzman, Elliot

AU - Goldstein, Louis

PY - 2009/11/26

Y1 - 2009/11/26

N2 - This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

AB - This paper analyzes the noise robustness of vocal tract constriction variable estimation and investigates their role for noise robust speech recognition. We implemented a simple direct inverse model using a feed-forward artificial neural network to estimate vocal tract variables (TVs) from the speech signal. Initially, we trained the model on clean synthetic speech and then test the noise robustness of the model on noise-corrupted speech. The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding TVs. Eight different vocal tract constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables (lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study. We also explored using a modified phase opponency (MPO) [2] speech enhancement technique as the preprocessor for TV estimation to observe its effect upon noise robustness. Kalman smoothing was applied to the estimated TVs to reduce the estimation noise. Finally the TV estimation module was tested using a naturally-produced speech that is contaminated with noise at different signal-to-noise ratios. The estimated TVs from the natural speech corpus are then used in conjunction with the baseline features to perform automatic speech recognition (ASR) experiments. Results show an average 22% and 21% improvement, relative to the baseline, on ASR performance using the Aurora-2 dataset with car and subway noise, respectively. The TVs in these experiments are estimated from the MPO-enhanced speech.

KW - Gestural phonology

KW - Kalman smoothing

KW - Neural networks

KW - Noise robust speech recognition

KW - Speech inversion

UR - http://www.scopus.com/inward/record.url?scp=70450200298&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70450200298&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:70450200298

SP - 2759

EP - 2762

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -