Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion

Ganesh Sivaraman, Vikramjit Mitra, Hosung Nam, Mark Tiede, Carol Espy-Wilson

Research output: Contribution to journalArticle

Abstract

Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

Original languageEnglish
Pages (from-to)316-329
Number of pages14
JournalJournal of the Acoustical Society of America
Volume146
Issue number1
DOIs
Publication statusPublished - 2019 Jul 1

Fingerprint

inversions
acoustics
normalizing
microbeams
education
trajectories
evaluation
coefficients
Inversion
Acoustics
x rays
Normalization

ASJC Scopus subject areas

  • Arts and Humanities (miscellaneous)
  • Acoustics and Ultrasonics

Cite this

Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion. / Sivaraman, Ganesh; Mitra, Vikramjit; Nam, Hosung; Tiede, Mark; Espy-Wilson, Carol.

In: Journal of the Acoustical Society of America, Vol. 146, No. 1, 01.07.2019, p. 316-329.

Research output: Contribution to journalArticle

Sivaraman, Ganesh ; Mitra, Vikramjit ; Nam, Hosung ; Tiede, Mark ; Espy-Wilson, Carol. / Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion. In: Journal of the Acoustical Society of America. 2019 ; Vol. 146, No. 1. pp. 316-329.
@article{b69d9cda9ab342ebb25cf8459402d6cc,
title = "Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion",
abstract = "Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15{\%} relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.",
author = "Ganesh Sivaraman and Vikramjit Mitra and Hosung Nam and Mark Tiede and Carol Espy-Wilson",
year = "2019",
month = "7",
day = "1",
doi = "10.1121/1.5116130",
language = "English",
volume = "146",
pages = "316--329",
journal = "Journal of the Acoustical Society of America",
issn = "0001-4966",
publisher = "Acoustical Society of America",
number = "1",

}

TY - JOUR

T1 - Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion

AU - Sivaraman, Ganesh

AU - Mitra, Vikramjit

AU - Nam, Hosung

AU - Tiede, Mark

AU - Espy-Wilson, Carol

PY - 2019/7/1

Y1 - 2019/7/1

N2 - Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

AB - Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

UR - http://www.scopus.com/inward/record.url?scp=85069781363&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069781363&partnerID=8YFLogxK

U2 - 10.1121/1.5116130

DO - 10.1121/1.5116130

M3 - Article

AN - SCOPUS:85069781363

VL - 146

SP - 316

EP - 329

JO - Journal of the Acoustical Society of America

JF - Journal of the Acoustical Society of America

SN - 0001-4966

IS - 1

ER -