Articulatory phonological code for word classification

Xiaodan Zhuang, Hosung Nam, Mark Hasegawa-Johnson, Louis Goldstein, Elliot Saltzman

Research output: Contribution to journalConference article

10 Citations (Scopus)

Abstract

We propose a framework that leverages articulatory phonology for speech recognition. "Gestural pattern vectors" (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the "canonical" gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the "canonical" gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.

Original languageEnglish
Pages (from-to)2763-2766
Number of pages4
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2009 Nov 27
Externally publishedYes
Event10th Annual Conference of the International Speech Communication Association, INTERSPEECH 2009 - Brighton, United Kingdom
Duration: 2009 Sep 62009 Sep 10

Fingerprint

Chemical activation
Vocabulary
Speech recognition
Observation
Dental Articulators
Dynamic models
Neural networks
Recovery
Experiments

Keywords

  • Artificial neural network
  • Gaussian mixture model
  • Speech gesture
  • Speech production
  • Tandem model

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Sensory Systems

Cite this

Articulatory phonological code for word classification. / Zhuang, Xiaodan; Nam, Hosung; Hasegawa-Johnson, Mark; Goldstein, Louis; Saltzman, Elliot.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 27.11.2009, p. 2763-2766.

Research output: Contribution to journalConference article

@article{0b081abab5974bb2a2696587c074182b,
title = "Articulatory phonological code for word classification",
abstract = "We propose a framework that leverages articulatory phonology for speech recognition. {"}Gestural pattern vectors{"} (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the {"}canonical{"} gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the {"}canonical{"} gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80{\%} of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85{\%} for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.",
keywords = "Artificial neural network, Gaussian mixture model, Speech gesture, Speech production, Tandem model",
author = "Xiaodan Zhuang and Hosung Nam and Mark Hasegawa-Johnson and Louis Goldstein and Elliot Saltzman",
year = "2009",
month = "11",
day = "27",
language = "English",
pages = "2763--2766",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Articulatory phonological code for word classification

AU - Zhuang, Xiaodan

AU - Nam, Hosung

AU - Hasegawa-Johnson, Mark

AU - Goldstein, Louis

AU - Saltzman, Elliot

PY - 2009/11/27

Y1 - 2009/11/27

N2 - We propose a framework that leverages articulatory phonology for speech recognition. "Gestural pattern vectors" (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the "canonical" gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the "canonical" gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.

AB - We propose a framework that leverages articulatory phonology for speech recognition. "Gestural pattern vectors" (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the "canonical" gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the "canonical" gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.

KW - Artificial neural network

KW - Gaussian mixture model

KW - Speech gesture

KW - Speech production

KW - Tandem model

UR - http://www.scopus.com/inward/record.url?scp=70450174439&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70450174439&partnerID=8YFLogxK

M3 - Conference article

SP - 2763

EP - 2766

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -