Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework

Vikramjit Mitra, Hosung Nam, Carol Y. Espy-Wilson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.

Original languageEnglish
Title of host publication2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings
Pages131-136
Number of pages6
DOIs
Publication statusPublished - 2011 Dec 1
Externally publishedYes
Event2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011 - Waikoloa, HI, United States
Duration: 2011 Dec 112011 Dec 15

Publication series

Name2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings

Conference

Conference2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011
CountryUnited States
CityWaikoloa, HI
Period11/12/1111/12/15

Fingerprint

Bayesian networks
Speech recognition
Acoustics
Experiments
X rays

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Cite this

Mitra, V., Nam, H., & Espy-Wilson, C. Y. (2011). Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework. In 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings (pp. 131-136). [6163918] (2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings). https://doi.org/10.1109/ASRU.2011.6163918

Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y.

2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings. 2011. p. 131-136 6163918 (2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mitra, V, Nam, H & Espy-Wilson, CY 2011, Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework. in 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings., 6163918, 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings, pp. 131-136, 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Waikoloa, HI, United States, 11/12/11. https://doi.org/10.1109/ASRU.2011.6163918
Mitra V, Nam H, Espy-Wilson CY. Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework. In 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings. 2011. p. 131-136. 6163918. (2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings). https://doi.org/10.1109/ASRU.2011.6163918
Mitra, Vikramjit ; Nam, Hosung ; Espy-Wilson, Carol Y. / Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings. 2011. pp. 131-136 (2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings).
@inproceedings{f46da2780b984e298a04928f4e948e1f,
title = "Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework",
abstract = "Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.",
author = "Vikramjit Mitra and Hosung Nam and Espy-Wilson, {Carol Y.}",
year = "2011",
month = "12",
day = "1",
doi = "10.1109/ASRU.2011.6163918",
language = "English",
isbn = "9781467303675",
series = "2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings",
pages = "131--136",
booktitle = "2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings",

}

TY - GEN

T1 - Robust speech recognition using articulatory gestures in a dynamic Bayesian network framework

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol Y.

PY - 2011/12/1

Y1 - 2011/12/1

N2 - Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.

AB - Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.

UR - http://www.scopus.com/inward/record.url?scp=84858964876&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84858964876&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2011.6163918

DO - 10.1109/ASRU.2011.6163918

M3 - Conference contribution

SN - 9781467303675

T3 - 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings

SP - 131

EP - 136

BT - 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings

ER -