Articulatory information for noise robust speech recognition

Vikramjit Mitra, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as "acoustic-to- articulatory inversion" or "speech-inversion." While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.

Original languageEnglish
Article number5677601
Pages (from-to)1913-1924
Number of pages12
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume19
Issue number7
DOIs
Publication statusPublished - 2011 Jul 25
Externally publishedYes

Fingerprint

speech recognition
Speech recognition
Acoustic noise
Acoustics
acoustics
inversions
constrictions
estimating
Neural networks

Keywords

  • Articulatory phonology
  • articulatory speech recognition
  • artificial neural networks (ANNs)
  • noise-robust speech recognition
  • speech inversion
  • task dynamic model
  • vocal-tract variables

ASJC Scopus subject areas

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

Cite this

Articulatory information for noise robust speech recognition. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y.; Saltzman, Elliot; Goldstein, Louis.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 19, No. 7, 5677601, 25.07.2011, p. 1913-1924.

Research output: Contribution to journalArticle

Mitra, Vikramjit ; Nam, Hosung ; Espy-Wilson, Carol Y. ; Saltzman, Elliot ; Goldstein, Louis. / Articulatory information for noise robust speech recognition. In: IEEE Transactions on Audio, Speech and Language Processing. 2011 ; Vol. 19, No. 7. pp. 1913-1924.
@article{27ad1632db204de3a1c057309b38cca5,
title = "Articulatory information for noise robust speech recognition",
abstract = "Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as {"}acoustic-to- articulatory inversion{"} or {"}speech-inversion.{"} While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.",
keywords = "Articulatory phonology, articulatory speech recognition, artificial neural networks (ANNs), noise-robust speech recognition, speech inversion, task dynamic model, vocal-tract variables",
author = "Vikramjit Mitra and Hosung Nam and Espy-Wilson, {Carol Y.} and Elliot Saltzman and Louis Goldstein",
year = "2011",
month = "7",
day = "25",
doi = "10.1109/TASL.2010.2103058",
language = "English",
volume = "19",
pages = "1913--1924",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "7",

}

TY - JOUR

T1 - Articulatory information for noise robust speech recognition

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol Y.

AU - Saltzman, Elliot

AU - Goldstein, Louis

PY - 2011/7/25

Y1 - 2011/7/25

N2 - Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as "acoustic-to- articulatory inversion" or "speech-inversion." While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.

AB - Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as "acoustic-to- articulatory inversion" or "speech-inversion." While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features.

KW - Articulatory phonology

KW - articulatory speech recognition

KW - artificial neural networks (ANNs)

KW - noise-robust speech recognition

KW - speech inversion

KW - task dynamic model

KW - vocal-tract variables

UR - http://www.scopus.com/inward/record.url?scp=79960545035&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960545035&partnerID=8YFLogxK

U2 - 10.1109/TASL.2010.2103058

DO - 10.1109/TASL.2010.2103058

M3 - Article

AN - SCOPUS:79960545035

VL - 19

SP - 1913

EP - 1924

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 7

M1 - 5677601

ER -