Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems.
ASJC Scopus subject areas
- Arts and Humanities (miscellaneous)
- Acoustics and Ultrasonics