Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion

Ganesh Sivaraman, Vikramjit Mitra, Hosung Nam, Mark Tiede, Carol Espy-Wilson

Research output: Contribution to journalConference article

10 Citations (Scopus)

Abstract

Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. This paper investigates a vocal tract length normalization (VTLN) technique to transform the acoustic space of different speakers to a target speaker space such that speaker specific details are minimized. The speaker normalized features are then used to train a feed-forward neural network based acoustic-toarticulatory speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients and the articulatory features are represented by six tract-variable (TV) trajectories. Experiments are performed with ten speakers from the U. Wisc. X-ray microbeam database. Speaker dependent speech inversion systems are trained for each speaker as baselines to compare the performance of the speaker independent approach. For each target speaker, data from the remaining nine speakers are transformed using the proposed approach and the transformed features are used to train a speech inversion system. The performances of the individual systems are compared using the correlation between the estimated and the actual TVs on the target speaker's test set. Results show that the proposed speaker normalization approach provides a 7% absolute improvement in correlation as compared to the system where speaker normalization was not performed.

Original languageEnglish
Pages (from-to)455-459
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume08-12-September-2016
DOIs
Publication statusPublished - 2016 Jan 1
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: 2016 Sep 82016 Sep 16

Fingerprint

Normalization
Inversion
Acoustics
Target
Feedforward neural networks
Feedforward Neural Networks
Ill-posed Problem
Test Set
Trajectories
Baseline
X rays
Speech
Vocal Tract
Length
Trajectory
Transform
Dependent
Coefficient
Experiments
Experiment

Keywords

  • Acoustic to articulatory speech inversion
  • Speaker normalization
  • Vocal Tract Length Normalization

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion. / Sivaraman, Ganesh; Mitra, Vikramjit; Nam, Hosung; Tiede, Mark; Espy-Wilson, Carol.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 01.01.2016, p. 455-459.

Research output: Contribution to journalConference article

@article{ac09c8e42eea4aaca3e0079a791957ff,
title = "Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion",
abstract = "Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. This paper investigates a vocal tract length normalization (VTLN) technique to transform the acoustic space of different speakers to a target speaker space such that speaker specific details are minimized. The speaker normalized features are then used to train a feed-forward neural network based acoustic-toarticulatory speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients and the articulatory features are represented by six tract-variable (TV) trajectories. Experiments are performed with ten speakers from the U. Wisc. X-ray microbeam database. Speaker dependent speech inversion systems are trained for each speaker as baselines to compare the performance of the speaker independent approach. For each target speaker, data from the remaining nine speakers are transformed using the proposed approach and the transformed features are used to train a speech inversion system. The performances of the individual systems are compared using the correlation between the estimated and the actual TVs on the target speaker's test set. Results show that the proposed speaker normalization approach provides a 7{\%} absolute improvement in correlation as compared to the system where speaker normalization was not performed.",
keywords = "Acoustic to articulatory speech inversion, Speaker normalization, Vocal Tract Length Normalization",
author = "Ganesh Sivaraman and Vikramjit Mitra and Hosung Nam and Mark Tiede and Carol Espy-Wilson",
year = "2016",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2016-1399",
language = "English",
volume = "08-12-September-2016",
pages = "455--459",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion

AU - Sivaraman, Ganesh

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Tiede, Mark

AU - Espy-Wilson, Carol

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. This paper investigates a vocal tract length normalization (VTLN) technique to transform the acoustic space of different speakers to a target speaker space such that speaker specific details are minimized. The speaker normalized features are then used to train a feed-forward neural network based acoustic-toarticulatory speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients and the articulatory features are represented by six tract-variable (TV) trajectories. Experiments are performed with ten speakers from the U. Wisc. X-ray microbeam database. Speaker dependent speech inversion systems are trained for each speaker as baselines to compare the performance of the speaker independent approach. For each target speaker, data from the remaining nine speakers are transformed using the proposed approach and the transformed features are used to train a speech inversion system. The performances of the individual systems are compared using the correlation between the estimated and the actual TVs on the target speaker's test set. Results show that the proposed speaker normalization approach provides a 7% absolute improvement in correlation as compared to the system where speaker normalization was not performed.

AB - Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. This paper investigates a vocal tract length normalization (VTLN) technique to transform the acoustic space of different speakers to a target speaker space such that speaker specific details are minimized. The speaker normalized features are then used to train a feed-forward neural network based acoustic-toarticulatory speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients and the articulatory features are represented by six tract-variable (TV) trajectories. Experiments are performed with ten speakers from the U. Wisc. X-ray microbeam database. Speaker dependent speech inversion systems are trained for each speaker as baselines to compare the performance of the speaker independent approach. For each target speaker, data from the remaining nine speakers are transformed using the proposed approach and the transformed features are used to train a speech inversion system. The performances of the individual systems are compared using the correlation between the estimated and the actual TVs on the target speaker's test set. Results show that the proposed speaker normalization approach provides a 7% absolute improvement in correlation as compared to the system where speaker normalization was not performed.

KW - Acoustic to articulatory speech inversion

KW - Speaker normalization

KW - Vocal Tract Length Normalization

UR - http://www.scopus.com/inward/record.url?scp=84994252380&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994252380&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1399

DO - 10.21437/Interspeech.2016-1399

M3 - Conference article

AN - SCOPUS:84994252380

VL - 08-12-September-2016

SP - 455

EP - 459

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -