Multitask Learning of Deep Neural Network Based Keyword Spotting for IoT Devices

Seong Gyun Leem, In Chul Yoo, Dongsuk Yook

Research output: Contribution to journalArticle

Abstract

Speech-based interfaces are convenient and intuitive, and therefore, strongly preferred by Internet of Things (IoT) devices for human-computer interaction. Pre-defined keywords are typically used as a trigger to notify devices for inputting the subsequent voice commands. Keyword spotting techniques used as voice trigger mechanisms, typically model the target keyword via triphone models and non-keywords through single-state filler models. Recently, deep neural networks (DNN) have shown better performance compared to hidden Markov models (HMM) with Gaussian mixture models (GMM), in various tasks including speech recognition. However, conventional DNN-based keyword spotting methods cannot change the target keywords easily, which is an essential feature for speech-based IoT device interface. Additionally, the increase in computational requirements interferes with the use of complex filler models in DNN-based keyword spotting systems, which diminishes the accuracy of such systems. In this paper, we propose a novel DNN-based keyword spotting system that alters the keyword on the fly and utilizes triphone and monophone acoustic models in an effort to reduce computational complexity and increase generalization performance. The experimental results using the FFMTIMIT corpus show that the error rate of the proposed method was reduced by 36.6%.

Original languageEnglish
JournalIEEE Transactions on Consumer Electronics
DOIs
Publication statusAccepted/In press - 2019 Jan 1

Fingerprint

Fillers
Hidden Markov models
Human computer interaction
Speech recognition
Internet of things
Deep neural networks
Computational complexity
Acoustics

Keywords

  • Acoustics
  • Computational modeling
  • Decoding
  • deep neural network
  • Hidden Markov models
  • Internet of Things
  • keyword spotting
  • multitask learning.
  • Neural networks
  • Speech recognition

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Cite this

Multitask Learning of Deep Neural Network Based Keyword Spotting for IoT Devices. / Leem, Seong Gyun; Yoo, In Chul; Yook, Dongsuk.

In: IEEE Transactions on Consumer Electronics, 01.01.2019.

Research output: Contribution to journalArticle

@article{eaceaa3e270a45dab2892dd44ce065ea,
title = "Multitask Learning of Deep Neural Network Based Keyword Spotting for IoT Devices",
abstract = "Speech-based interfaces are convenient and intuitive, and therefore, strongly preferred by Internet of Things (IoT) devices for human-computer interaction. Pre-defined keywords are typically used as a trigger to notify devices for inputting the subsequent voice commands. Keyword spotting techniques used as voice trigger mechanisms, typically model the target keyword via triphone models and non-keywords through single-state filler models. Recently, deep neural networks (DNN) have shown better performance compared to hidden Markov models (HMM) with Gaussian mixture models (GMM), in various tasks including speech recognition. However, conventional DNN-based keyword spotting methods cannot change the target keywords easily, which is an essential feature for speech-based IoT device interface. Additionally, the increase in computational requirements interferes with the use of complex filler models in DNN-based keyword spotting systems, which diminishes the accuracy of such systems. In this paper, we propose a novel DNN-based keyword spotting system that alters the keyword on the fly and utilizes triphone and monophone acoustic models in an effort to reduce computational complexity and increase generalization performance. The experimental results using the FFMTIMIT corpus show that the error rate of the proposed method was reduced by 36.6{\%}.",
keywords = "Acoustics, Computational modeling, Decoding, deep neural network, Hidden Markov models, Internet of Things, keyword spotting, multitask learning., Neural networks, Speech recognition",
author = "Leem, {Seong Gyun} and Yoo, {In Chul} and Dongsuk Yook",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/TCE.2019.2899067",
language = "English",
journal = "IEEE Transactions on Consumer Electronics",
issn = "0098-3063",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Multitask Learning of Deep Neural Network Based Keyword Spotting for IoT Devices

AU - Leem, Seong Gyun

AU - Yoo, In Chul

AU - Yook, Dongsuk

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Speech-based interfaces are convenient and intuitive, and therefore, strongly preferred by Internet of Things (IoT) devices for human-computer interaction. Pre-defined keywords are typically used as a trigger to notify devices for inputting the subsequent voice commands. Keyword spotting techniques used as voice trigger mechanisms, typically model the target keyword via triphone models and non-keywords through single-state filler models. Recently, deep neural networks (DNN) have shown better performance compared to hidden Markov models (HMM) with Gaussian mixture models (GMM), in various tasks including speech recognition. However, conventional DNN-based keyword spotting methods cannot change the target keywords easily, which is an essential feature for speech-based IoT device interface. Additionally, the increase in computational requirements interferes with the use of complex filler models in DNN-based keyword spotting systems, which diminishes the accuracy of such systems. In this paper, we propose a novel DNN-based keyword spotting system that alters the keyword on the fly and utilizes triphone and monophone acoustic models in an effort to reduce computational complexity and increase generalization performance. The experimental results using the FFMTIMIT corpus show that the error rate of the proposed method was reduced by 36.6%.

AB - Speech-based interfaces are convenient and intuitive, and therefore, strongly preferred by Internet of Things (IoT) devices for human-computer interaction. Pre-defined keywords are typically used as a trigger to notify devices for inputting the subsequent voice commands. Keyword spotting techniques used as voice trigger mechanisms, typically model the target keyword via triphone models and non-keywords through single-state filler models. Recently, deep neural networks (DNN) have shown better performance compared to hidden Markov models (HMM) with Gaussian mixture models (GMM), in various tasks including speech recognition. However, conventional DNN-based keyword spotting methods cannot change the target keywords easily, which is an essential feature for speech-based IoT device interface. Additionally, the increase in computational requirements interferes with the use of complex filler models in DNN-based keyword spotting systems, which diminishes the accuracy of such systems. In this paper, we propose a novel DNN-based keyword spotting system that alters the keyword on the fly and utilizes triphone and monophone acoustic models in an effort to reduce computational complexity and increase generalization performance. The experimental results using the FFMTIMIT corpus show that the error rate of the proposed method was reduced by 36.6%.

KW - Acoustics

KW - Computational modeling

KW - Decoding

KW - deep neural network

KW - Hidden Markov models

KW - Internet of Things

KW - keyword spotting

KW - multitask learning.

KW - Neural networks

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85061546748&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061546748&partnerID=8YFLogxK

U2 - 10.1109/TCE.2019.2899067

DO - 10.1109/TCE.2019.2899067

M3 - Article

AN - SCOPUS:85061546748

JO - IEEE Transactions on Consumer Electronics

JF - IEEE Transactions on Consumer Electronics

SN - 0098-3063

ER -