Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition

Vikramjit Mitra, Ganesh Sivaraman, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, Mark Tiede

Research output: Contribution to journalArticle

30 Citations (Scopus)

Abstract

Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.

Original languageEnglish
Pages (from-to)103-112
Number of pages10
JournalSpeech Communication
Volume89
DOIs
Publication statusPublished - 2017 May 1

Fingerprint

Speech Recognition
Speech recognition
neural network
acoustics
Acoustics
Neural Networks
Neural networks
Convolution
Continuous speech recognition
Acoustic Model
Inverse Model
energy
Model
Space Complexity
Ill-posed Problem
Electron energy levels
Energy Levels
Neural Network Model
Error Rate
Baseline

Keywords

  • Articulatory trajectories
  • Automatic speech recognition
  • Convolutional neural networks
  • Hybrid convolutional neural networks
  • Time-frequency convolution
  • Vocal tract variables

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Modelling and Simulation
  • Communication
  • Linguistics and Language
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this

Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition. / Mitra, Vikramjit; Sivaraman, Ganesh; Nam, Hosung; Espy-Wilson, Carol; Saltzman, Elliot; Tiede, Mark.

In: Speech Communication, Vol. 89, 01.05.2017, p. 103-112.

Research output: Contribution to journalArticle

Mitra, Vikramjit ; Sivaraman, Ganesh ; Nam, Hosung ; Espy-Wilson, Carol ; Saltzman, Elliot ; Tiede, Mark. / Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition. In: Speech Communication. 2017 ; Vol. 89. pp. 103-112.
@article{6b6a25ca2a3649cead3f5fb644e938c6,
title = "Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition",
abstract = "Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.",
keywords = "Articulatory trajectories, Automatic speech recognition, Convolutional neural networks, Hybrid convolutional neural networks, Time-frequency convolution, Vocal tract variables",
author = "Vikramjit Mitra and Ganesh Sivaraman and Hosung Nam and Carol Espy-Wilson and Elliot Saltzman and Mark Tiede",
year = "2017",
month = "5",
day = "1",
doi = "10.1016/j.specom.2017.03.003",
language = "English",
volume = "89",
pages = "103--112",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",

}

TY - JOUR

T1 - Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition

AU - Mitra, Vikramjit

AU - Sivaraman, Ganesh

AU - Nam, Hosung

AU - Espy-Wilson, Carol

AU - Saltzman, Elliot

AU - Tiede, Mark

PY - 2017/5/1

Y1 - 2017/5/1

N2 - Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.

AB - Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the articulatory and acoustic space increase complexity of speech-to-articulatory mapping, which is already an ill-posed problem due to its inherent nonlinearity and non-unique nature. This work explores using deep neural networks (DNNs) and convolutional neural networks (CNNs) for mapping speech data into its corresponding articulatory space. Our speech-inversion results indicate that the CNN models perform better than their DNN counterparts. In addition, we use these inverse-models to generate articulatory information from speech for two separate speech recognition tasks: the WSJ1 and Aurora-4 continuous speech recognition tasks. This work proposes a hybrid convolutional neural network (HCNN), where two parallel layers are used to jointly model the acoustic and articulatory spaces, and the decisions from the parallel layers are fused at the output context-dependent (CD) state level. The acoustic model performs time-frequency convolution on filterbank-energy-level features, whereas the articulatory model performs time convolution on the articulatory features. The performance of the proposed architecture is compared to that of the CNN- and DNN-based systems using gammatone filterbank energies as acoustic features, and the results indicate that the HCNN-based model demonstrates lower word error rates compared to the CNN/DNN baseline systems.

KW - Articulatory trajectories

KW - Automatic speech recognition

KW - Convolutional neural networks

KW - Hybrid convolutional neural networks

KW - Time-frequency convolution

KW - Vocal tract variables

UR - http://www.scopus.com/inward/record.url?scp=85017178582&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017178582&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2017.03.003

DO - 10.1016/j.specom.2017.03.003

M3 - Article

AN - SCOPUS:85017178582

VL - 89

SP - 103

EP - 112

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

ER -