TY - GEN
T1 - Robust word recognition using articulatory trajectories and gestures
AU - Mitra, Vikramjit
AU - Nam, Hosung
AU - Espy-Wilson, Carol
AU - Saltzman, Elliot
AU - Goldstein, Louis
N1 - Funding Information:
This research was supported IIS0703048, and IIS0703782
Funding Information:
by NSF Grant # IIS0703859,
PY - 2010
Y1 - 2010
N2 - Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the "synthetic-data trained" articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.
AB - Articulatory Phonology views speech as an ensemble of constricting events (e.g. narrowing lips, raising tongue tip), gestures, at distinct organs (lips, tongue tip, tongue body, velum, and glottis) along the vocal tract. This study shows that articulatory information in the form of gestures and their output trajectories (tract variable time functions or TVs) can help to improve the performance of automatic speech recognition systems. The lack of any natural speech database containing such articulatory information prompted us to use a synthetic speech dataset (obtained from Haskins Laboratories TAsk Dynamic model of speech production) that contains acoustic waveform for a given utterance and its corresponding gestures and TVs. First, we propose neural network based models to recognize the gestures and estimate the TVs from acoustic information. Second, the "synthetic-data trained" articulatory models were applied to the natural speech utterances in Aurora-2 corpus to estimate their gestures and TVs. Finally, we show that the estimated articulatory information helps to improve the noise robustness of a word recognition system when used along with the cepstral features.
KW - Articulatory phonology
KW - Noise robust speech recognition
KW - Speech gestures
KW - Speech inversion
KW - TADA model neural networks
KW - Tract variables
UR - http://www.scopus.com/inward/record.url?scp=79959813685&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79959813685
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 2038
EP - 2041
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -