Speech inversion: Benefits of tract variables over pellet trajectories

Vikramjit Mitra, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, Louis Goldstein

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.

Original languageEnglish
Title of host publication2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Pages5188-5191
Number of pages4
DOIs
Publication statusPublished - 2011 Aug 18
Externally publishedYes
Event36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Prague, Czech Republic
Duration: 2011 May 222011 May 27

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Other

Other36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
CountryCzech Republic
CityPrague
Period11/5/2211/5/27

Fingerprint

Trajectories
Network architecture
Speech recognition
Inverse problems
Transducers
Acoustics
Neural networks
Experiments

Keywords

  • Artificial Neural Networks
  • Non-uniqueness
  • Speech inversion
  • Tract variable time functions
  • Vocal tract constriction variables

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Mitra, V., Nam, H., Espy-Wilson, C. Y., Saltzman, E., & Goldstein, L. (2011). Speech inversion: Benefits of tract variables over pellet trajectories. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings (pp. 5188-5191). [5947526] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2011.5947526

Speech inversion : Benefits of tract variables over pellet trajectories. / Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y.; Saltzman, Elliot; Goldstein, Louis.

2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 5188-5191 5947526 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mitra, V, Nam, H, Espy-Wilson, CY, Saltzman, E & Goldstein, L 2011, Speech inversion: Benefits of tract variables over pellet trajectories. in 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings., 5947526, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5188-5191, 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, Prague, Czech Republic, 11/5/22. https://doi.org/10.1109/ICASSP.2011.5947526
Mitra V, Nam H, Espy-Wilson CY, Saltzman E, Goldstein L. Speech inversion: Benefits of tract variables over pellet trajectories. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 5188-5191. 5947526. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2011.5947526
Mitra, Vikramjit ; Nam, Hosung ; Espy-Wilson, Carol Y. ; Saltzman, Elliot ; Goldstein, Louis. / Speech inversion : Benefits of tract variables over pellet trajectories. 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. pp. 5188-5191 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{a82d45fffe9049b58f4b29474929b260,
title = "Speech inversion: Benefits of tract variables over pellet trajectories",
abstract = "Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.",
keywords = "Artificial Neural Networks, Non-uniqueness, Speech inversion, Tract variable time functions, Vocal tract constriction variables",
author = "Vikramjit Mitra and Hosung Nam and Espy-Wilson, {Carol Y.} and Elliot Saltzman and Louis Goldstein",
year = "2011",
month = "8",
day = "18",
doi = "10.1109/ICASSP.2011.5947526",
language = "English",
isbn = "9781457705397",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
pages = "5188--5191",
booktitle = "2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings",

}

TY - GEN

T1 - Speech inversion

T2 - Benefits of tract variables over pellet trajectories

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Espy-Wilson, Carol Y.

AU - Saltzman, Elliot

AU - Goldstein, Louis

PY - 2011/8/18

Y1 - 2011/8/18

N2 - Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.

AB - Speech inversion is a way of estimating articulatory trajectories or vocal tract configurations from the acoustic speech signal. Traditionally, articulator flesh-point or pellet trajectories have been used in speech-inversion research; however such information introduces additional variability into the inverse problem given they are head-centered, task-neutral measures. This paper proposes the use of vocal tract constriction variables (TVs) that are less variable for speech-inversion since they are constriction-based, task-specific measures. TVs considered in this study consist of five constriction degree variables, lip aperture (LA), tongue body constriction degree (TBCD), tongue tip constriction degree (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip constriction location (TTCL) and tongue body constriction location (TBCL). Six different flesh-point trajectories were considered that were measured with transducers placed on the upper lip (UL), lower lip (LL) and four positions on the tongue (T1, T2, T3 and T4) between the tongue tip and the tongue dorsum. Speech inversion using a simple neural network architecture shows that the TVs can be estimated relatively more accurately than the pellet trajectories. Further statistical investigation reveals that the non-uniqueness is reduced in the TVs compared to the pellet trajectories for phones which are known to appreciably suffer from non-uniqueness. Finally we perform word recognition experiments using the estimated TVs as opposed to the pellet trajectories and show that the former offers greater word recognition accuracy both in clean and noisy speech, indicating that the TVs are a better choice for speech recognition systems.

KW - Artificial Neural Networks

KW - Non-uniqueness

KW - Speech inversion

KW - Tract variable time functions

KW - Vocal tract constriction variables

UR - http://www.scopus.com/inward/record.url?scp=80051617129&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051617129&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2011.5947526

DO - 10.1109/ICASSP.2011.5947526

M3 - Conference contribution

AN - SCOPUS:80051617129

SN - 9781457705397

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 5188

EP - 5191

BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings

ER -