A procedure for estimating gestural scores from natural speech

Hosung Nam, Vikramjit Mitra, Mark Tiede, Elliot Saltzman, Louis Goldstein, Carol Espy-Wilson, Mark Hasegawa-Johnson

Research output: Contribution to conferencePaper

7 Citations (Scopus)

Abstract

Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.

Original languageEnglish
Pages30-33
Number of pages4
Publication statusPublished - 2010 Dec 1
Externally publishedYes
Event11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010 - Makuhari, Chiba, Japan
Duration: 2010 Sep 262010 Sep 30

Other

Other11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010
CountryJapan
CityMakuhari, Chiba
Period10/9/2610/9/30

Fingerprint

Gestures
Acoustics
Natural Speech
Word Recognition
Gesture
Annotation
Utterance
Constellation
Vocal Tract
Automatic Speech Recognition
Trajectory
Prototype
Conventional
Experiment
Spectrality
Landmarks

Keywords

  • Articulatory phonology
  • Gestures
  • TADA model
  • Time warping
  • Vocal tract variables
  • X-ray microbeam data

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing

Cite this

Nam, H., Mitra, V., Tiede, M., Saltzman, E., Goldstein, L., Espy-Wilson, C., & Hasegawa-Johnson, M. (2010). A procedure for estimating gestural scores from natural speech. 30-33. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.

A procedure for estimating gestural scores from natural speech. / Nam, Hosung; Mitra, Vikramjit; Tiede, Mark; Saltzman, Elliot; Goldstein, Louis; Espy-Wilson, Carol; Hasegawa-Johnson, Mark.

2010. 30-33 Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.

Research output: Contribution to conferencePaper

Nam, H, Mitra, V, Tiede, M, Saltzman, E, Goldstein, L, Espy-Wilson, C & Hasegawa-Johnson, M 2010, 'A procedure for estimating gestural scores from natural speech' Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan, 10/9/26 - 10/9/30, pp. 30-33.
Nam H, Mitra V, Tiede M, Saltzman E, Goldstein L, Espy-Wilson C et al. A procedure for estimating gestural scores from natural speech. 2010. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.
Nam, Hosung ; Mitra, Vikramjit ; Tiede, Mark ; Saltzman, Elliot ; Goldstein, Louis ; Espy-Wilson, Carol ; Hasegawa-Johnson, Mark. / A procedure for estimating gestural scores from natural speech. Paper presented at 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, Makuhari, Chiba, Japan.4 p.
@conference{686e3b18298740808caff31695b47a7a,
title = "A procedure for estimating gestural scores from natural speech",
abstract = "Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.",
keywords = "Articulatory phonology, Gestures, TADA model, Time warping, Vocal tract variables, X-ray microbeam data",
author = "Hosung Nam and Vikramjit Mitra and Mark Tiede and Elliot Saltzman and Louis Goldstein and Carol Espy-Wilson and Mark Hasegawa-Johnson",
year = "2010",
month = "12",
day = "1",
language = "English",
pages = "30--33",
note = "11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010 ; Conference date: 26-09-2010 Through 30-09-2010",

}

TY - CONF

T1 - A procedure for estimating gestural scores from natural speech

AU - Nam, Hosung

AU - Mitra, Vikramjit

AU - Tiede, Mark

AU - Saltzman, Elliot

AU - Goldstein, Louis

AU - Espy-Wilson, Carol

AU - Hasegawa-Johnson, Mark

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.

AB - Speech can be represented as a constellation of constricting events, gestures, which are defined at distinct vocal tract sites, in the form of a gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve automatic speech recognition (ASR) performance. In this paper we propose an iterative analysis-by-synthesis landmark based time-warping architecture to obtain gestural scores for natural speech. Given an utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model was used to generate its prototype gestural score and the corresponding synthetic acoustic output. An optimal gestural score was estimated through iterative time-warping processes such that the distance between original and TADA-synthesized speech is minimized. We compared the performance of our approach to that of a conventional dynamic time warping procedure using Log-Spectral and Itakura Distance measures. We also performed a word recognition experiment using the gestural annotations to show that the gestural scores are suitable for word recognition.

KW - Articulatory phonology

KW - Gestures

KW - TADA model

KW - Time warping

KW - Vocal tract variables

KW - X-ray microbeam data

UR - http://www.scopus.com/inward/record.url?scp=79959846806&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959846806&partnerID=8YFLogxK

M3 - Paper

SP - 30

EP - 33

ER -