TY - GEN
T1 - FRE-GAN 2
T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
AU - Lee, Sang Hoon
AU - Kim, Ji Hoon
AU - Lee, Kang Eun
AU - Lee, Seong Whan
N1 - Funding Information:
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University)) and Netmarble AI Center.
Publisher Copyright:
© 2022 IEEE
PY - 2022
Y1 - 2022
N2 - Although recent advances in neural vocoder have shown significant improvement, most of these models have a trade-off between audio quality and computational complexity. Since the large model has a limitation on the low-resource devices, a more efficient neural vocoder should synthesize high-quality audio for practical applicability. In this paper, we present Fre-GAN 2, a fast and efficient high-quality audio synthesis model. For fast synthesis, Fre-GAN 2 only synthesizes low and high-frequency parts of the audio, and we leverage the inverse discrete wavelet transform to reproduce the target-resolution audio in the generator. Additionally, we also introduce adversarial periodic feature distillation, which makes the model synthesize high-quality audio with only a small parameter. The experimental results show the superiority of Fre-GAN 2 in audio quality. Furthermore, FreGAN 2 has a 10.91× generation acceleration, and the parameters are compressed by 21.23× than Fre-GAN.
AB - Although recent advances in neural vocoder have shown significant improvement, most of these models have a trade-off between audio quality and computational complexity. Since the large model has a limitation on the low-resource devices, a more efficient neural vocoder should synthesize high-quality audio for practical applicability. In this paper, we present Fre-GAN 2, a fast and efficient high-quality audio synthesis model. For fast synthesis, Fre-GAN 2 only synthesizes low and high-frequency parts of the audio, and we leverage the inverse discrete wavelet transform to reproduce the target-resolution audio in the generator. Additionally, we also introduce adversarial periodic feature distillation, which makes the model synthesize high-quality audio with only a small parameter. The experimental results show the superiority of Fre-GAN 2 in audio quality. Furthermore, FreGAN 2 has a 10.91× generation acceleration, and the parameters are compressed by 21.23× than Fre-GAN.
KW - audio synthesis
KW - generative adversarial networks
KW - neural vocoder
KW - speech synthesis
KW - test-to-speech
UR - http://www.scopus.com/inward/record.url?scp=85131228375&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9746675
DO - 10.1109/ICASSP43922.2022.9746675
M3 - Conference contribution
AN - SCOPUS:85131228375
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6192
EP - 6196
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 May 2022 through 27 May 2022
ER -