TY - GEN
T1 - Fre-GAN
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Kim, Ji Hoon
AU - Lee, Sang Hoon
AU - Lee, Ji Hyun
AU - Lee, Seong Whan
N1 - Funding Information:
This work was supported by Institute for Information & communications Technology Planning & evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Department of Artificial Intelligence, Korea University) and the Netmarble AI Center.
Publisher Copyright:
© 2021 ISCA
PY - 2021
Y1 - 2021
N2 - Although recent works on neural vocoder have improved the quality of synthesized audio, there still exists a gap between generated and ground-truth audio in frequency space. This difference leads to spectral artifacts such as hissing noise or reverberation, and thus degrades the sample quality. In this paper, we propose Fre-GAN which achieves frequency-consistent audio synthesis with highly improved generation quality. Specifically, we first present resolution-connected generator and resolution-wise discriminators, which help learn various scales of spectral distributions over multiple frequency bands. Additionally, to reproduce high-frequency components accurately, we leverage discrete wavelet transform in the discriminators. From our experiments, Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio while outperforming standard models in quality.
AB - Although recent works on neural vocoder have improved the quality of synthesized audio, there still exists a gap between generated and ground-truth audio in frequency space. This difference leads to spectral artifacts such as hissing noise or reverberation, and thus degrades the sample quality. In this paper, we propose Fre-GAN which achieves frequency-consistent audio synthesis with highly improved generation quality. Specifically, we first present resolution-connected generator and resolution-wise discriminators, which help learn various scales of spectral distributions over multiple frequency bands. Additionally, to reproduce high-frequency components accurately, we leverage discrete wavelet transform in the discriminators. From our experiments, Fre-GAN achieves high-fidelity waveform generation with a gap of only 0.03 MOS compared to ground-truth audio while outperforming standard models in quality.
KW - Audio synthesis
KW - Discrete wavelet transform
KW - Generative adversarial networks
KW - Neural vocoder
UR - http://www.scopus.com/inward/record.url?scp=85119181453&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-845
DO - 10.21437/Interspeech.2021-845
M3 - Conference contribution
AN - SCOPUS:85119181453
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3246
EP - 3250
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -