Gesture-based dynamic Bayesian network for noise robust speech recognition

Vikramjit Mitra, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

Previously we have proposed different models for estimating articulatory gestures and vocal tract variable (TV) trajectories from synthetic speech. We have shown that when deployed on natural speech, such models can help to improve the noise robustness of a hidden Markov model (HMM) based speech recognition system. In this paper we propose a model for estimating TVs trained on natural speech and present a Dynamic Bayesian Network (DBN) based speech recognition architecture that treats vocal tract constriction gestures as hidden variables, eliminating the necessity for explicit gesture recognition. Using the proposed architecture we performed a word recognition task for the noisy data of Aurora-2. Significant improvement was observed in using the gestural information as hidden variables in a DBN architecture over using only the mel-frequency cepstral coefficient based HMM or DBN backend. We also compare our results with other noise-robust front ends.

Original languageEnglish
Title of host publication2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Pages5172-5175
Number of pages4
DOIs
Publication statusPublished - 2011 Aug 18
Externally publishedYes
Event36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Prague, Czech Republic
Duration: 2011 May 222011 May 27

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Other

Other36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
CountryCzech Republic
CityPrague
Period11/5/2211/5/27

Fingerprint

Bayesian networks
Speech recognition
Acoustic noise
Hidden Markov models
Gesture recognition
Network architecture
Trajectories

Keywords

  • Articulatory Phonology
  • Articulatory Speech Recognition
  • Dynamic Bayesian Network
  • Noise-robust Speech Recognition
  • Task Dynamic model
  • Vocal-Tract variables

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Mitra, V., Nam, H., Espy-Wilson, C. Y., Saltzman, E., & Goldstein, L. (2011). Gesture-based dynamic Bayesian network for noise robust speech recognition. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings (pp. 5172-5175). [5947522] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2011.5947522

Gesture-based dynamic Bayesian network for noise robust speech recognition. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y.; Saltzman, Elliot; Goldstein, Louis.

2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 5172-5175 5947522 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mitra, V, Nam, H, Espy-Wilson, CY, Saltzman, E & Goldstein, L 2011, Gesture-based dynamic Bayesian network for noise robust speech recognition. in 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings., 5947522, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5172-5175, 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, Prague, Czech Republic, 11/5/22. https://doi.org/10.1109/ICASSP.2011.5947522
Mitra V, Nam H, Espy-Wilson CY, Saltzman E, Goldstein L. Gesture-based dynamic Bayesian network for noise robust speech recognition. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 5172-5175. 5947522. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2011.5947522
Mitra, Vikramjit ; Nam, Hosung ; Espy-Wilson, Carol Y. ; Saltzman, Elliot ; Goldstein, Louis. / Gesture-based dynamic Bayesian network for noise robust speech recognition. 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. pp. 5172-5175 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{4092a814bcc041f3af1a6745a8b3fbcb,
title = "Gesture-based dynamic Bayesian network for noise robust speech recognition",
abstract = "Previously we have proposed different models for estimating articulatory gestures and vocal tract variable (TV) trajectories from synthetic speech. We have shown that when deployed on natural speech, such models can help to improve the noise robustness of a hidden Markov model (HMM) based speech recognition system. In this paper we propose a model for estimating TVs trained on natural speech and present a Dynamic Bayesian Network (DBN) based speech recognition architecture that treats vocal tract constriction gestures as hidden variables, eliminating the necessity for explicit gesture recognition. Using the proposed architecture we performed a word recognition task for the noisy data of Aurora-2. Significant improvement was observed in using the gestural information as hidden variables in a DBN architecture over using only the mel-frequency cepstral coefficient based HMM or DBN backend. We also compare our results with other noise-robust front ends.",
keywords = "Articulatory Phonology, Articulatory Speech Recognition, Dynamic Bayesian Network, Noise-robust Speech Recognition, Task Dynamic model, Vocal-Tract variables",
author = "Vikramjit Mitra and Hosung Nam and Espy-Wilson, {Carol Y.} and Elliot Saltzman and Louis Goldstein",
year = "2011",
month = "8",
day = "18",
doi = "10.1109/ICASSP.2011.5947522",
language = "English",
isbn = "9781457705397",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
pages = "5172--5175",
booktitle = "2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings",

}

TY - GEN

T1 - Gesture-based dynamic Bayesian network for noise robust speech recognition

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol Y.

AU - Saltzman, Elliot

AU - Goldstein, Louis

PY - 2011/8/18

Y1 - 2011/8/18

N2 - Previously we have proposed different models for estimating articulatory gestures and vocal tract variable (TV) trajectories from synthetic speech. We have shown that when deployed on natural speech, such models can help to improve the noise robustness of a hidden Markov model (HMM) based speech recognition system. In this paper we propose a model for estimating TVs trained on natural speech and present a Dynamic Bayesian Network (DBN) based speech recognition architecture that treats vocal tract constriction gestures as hidden variables, eliminating the necessity for explicit gesture recognition. Using the proposed architecture we performed a word recognition task for the noisy data of Aurora-2. Significant improvement was observed in using the gestural information as hidden variables in a DBN architecture over using only the mel-frequency cepstral coefficient based HMM or DBN backend. We also compare our results with other noise-robust front ends.

AB - Previously we have proposed different models for estimating articulatory gestures and vocal tract variable (TV) trajectories from synthetic speech. We have shown that when deployed on natural speech, such models can help to improve the noise robustness of a hidden Markov model (HMM) based speech recognition system. In this paper we propose a model for estimating TVs trained on natural speech and present a Dynamic Bayesian Network (DBN) based speech recognition architecture that treats vocal tract constriction gestures as hidden variables, eliminating the necessity for explicit gesture recognition. Using the proposed architecture we performed a word recognition task for the noisy data of Aurora-2. Significant improvement was observed in using the gestural information as hidden variables in a DBN architecture over using only the mel-frequency cepstral coefficient based HMM or DBN backend. We also compare our results with other noise-robust front ends.

KW - Articulatory Phonology

KW - Articulatory Speech Recognition

KW - Dynamic Bayesian Network

KW - Noise-robust Speech Recognition

KW - Task Dynamic model

KW - Vocal-Tract variables

UR - http://www.scopus.com/inward/record.url?scp=80051649631&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051649631&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2011.5947522

DO - 10.1109/ICASSP.2011.5947522

M3 - Conference contribution

AN - SCOPUS:80051649631

SN - 9781457705397

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 5172

EP - 5175

BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings

ER -