Robust word recognition using articulatory trajectories and gestures

Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Contribution to conferencePaper

8 Citations (Scopus)

Abstract

Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the "synthetic-data trained" articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.

Original languageEnglish
Pages2038-2041
Number of pages4
Publication statusPublished - 2010 Dec 1
Externally publishedYes
Event11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010 - Makuhari, Chiba, Japan
Duration: 2010 Sep 262010 Sep 30

Other

Other11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010
CountryJapan
CityMakuhari, Chiba
Period10/9/2610/9/30

Fingerprint

Gestures
Tongue
Lip
Acoustics
Glottis
Neural Networks (Computer)
Noise
Trajectory
Word Recognition
Gesture
Databases

Keywords

  • Articulatory phonology
  • Noise robust speech recognition
  • Speech gestures
  • Speech inversion
  • TADA model neural networks
  • Tract variables

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing

Cite this

Mitra, V., Nam, H., Espy-Wilson, C., Saltzman, E., & Goldstein, L. (2010). Robust word recognition using articulatory trajectories and gestures. 2038-2041. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.

Robust word recognition using articulatory trajectories and gestures. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol; Saltzman, Elliot; Goldstein, Louis.

2010. 2038-2041 Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.

Research output: Contribution to conferencePaper

Mitra, V, Nam, H, Espy-Wilson, C, Saltzman, E & Goldstein, L 2010, 'Robust word recognition using articulatory trajectories and gestures' Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan, 10/9/26 - 10/9/30, pp. 2038-2041.
Mitra V, Nam H, Espy-Wilson C, Saltzman E, Goldstein L. Robust word recognition using articulatory trajectories and gestures. 2010. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.
Mitra, Vikramjit ; Nam, Hosung ; Espy-Wilson, Carol ; Saltzman, Elliot ; Goldstein, Louis. / Robust word recognition using articulatory trajectories and gestures. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.4 p.
@conference{20f851055cb84403a5965ef039c3f740,
title = "Robust word recognition using articulatory trajectories and gestures",
abstract = "Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the {"}synthetic-data trained{"} articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.",
keywords = "Articulatory phonology, Noise robust speech recognition, Speech gestures, Speech inversion, TADA model neural networks, Tract variables",
author = "Vikramjit Mitra and Hosung Nam and Carol Espy-Wilson and Elliot Saltzman and Louis Goldstein",
year = "2010",
month = "12",
day = "1",
language = "English",
pages = "2038--2041",
note = "11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010 ; Conference date: 26-09-2010 Through 30-09-2010",

}

TY - CONF

T1 - Robust word recognition using articulatory trajectories and gestures

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol

AU - Saltzman, Elliot

AU - Goldstein, Louis

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the "synthetic-data trained" articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.

AB - Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the "synthetic-data trained" articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.

KW - Articulatory phonology

KW - Noise robust speech recognition

KW - Speech gestures

KW - Speech inversion

KW - TADA model neural networks

KW - Tract variables

UR - http://www.scopus.com/inward/record.url?scp=79959813685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959813685&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:79959813685

SP - 2038

EP - 2041

ER -