Audio dequantization for high fidelity audio generation in flow-based neural vocoder

Hyun Wook Yoon, Sang Hoon Lee, Hyeong Rae Noh, Seong Whan Lee

Research output: Contribution to journalConference articlepeer-review

Abstract

In recent works, a flow-based neural vocoder has shown significant improvement in real-time speech generation task. The sequence of invertible flow operations allows the model to convert samples from simple distribution to audio samples. However, training a continuous density model on discrete audio data can degrade model performance due to the topological difference between latent and actual distribution. To resolve this problem, we propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in image generation but has not yet been studied in the audio domain. For this reason, we implement various audio dequantization methods in flow-based neural vocoder and investigate the effect on the generated audio. We conduct various objective performance assessments and subjective evaluation to show that audio dequantization can improve audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.

Original languageEnglish
Pages (from-to)3545-3549
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2020-October
DOIs
Publication statusPublished - 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 2020 Oct 252020 Oct 29

Keywords

  • Audio synthesis
  • Data dequantization
  • Deep learning
  • Flow-based generative models
  • Neural vocoder

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Audio dequantization for high fidelity audio generation in flow-based neural vocoder'. Together they form a unique fingerprint.

Cite this