TY - JOUR
T1 - Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance
AU - Kao, Chao Yuan
AU - Ko, Hanseok
N1 - Funding Information:
This research is funded by the Ministry of Environment supported by the Korea Environmental Industry & Technology Institute’s environmental policy-based public technology development project (2017000210001).
Publisher Copyright:
© 2019 Acoustical Society of Korea. All rights reserved.
PY - 2019
Y1 - 2019
N2 - As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and MultiTask AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.
AB - As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and MultiTask AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.
KW - Deep Neural Network (DNN)
KW - Robust speech recognition
KW - Speech enhancement
KW - Wasserstein Generative Adversarial Network (WGAN)
KW - Weight initialization
UR - http://www.scopus.com/inward/record.url?scp=85079175884&partnerID=8YFLogxK
U2 - 10.7776/ASK.2019.38.6.670
DO - 10.7776/ASK.2019.38.6.670
M3 - Article
AN - SCOPUS:85079175884
SN - 1225-4428
VL - 38
SP - 670
EP - 677
JO - Journal of the Acoustical Society of Korea
JF - Journal of the Acoustical Society of Korea
IS - 6
ER -