AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries

Woosung Choi, Minseok Kim, Marco A. Martínez Ramírez, Jaehwa Chung, Soonyoung Jung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is 'transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

Original languageEnglish
Title of host publicationMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages1775-1783
Number of pages9
ISBN (Electronic)9781450386517
DOIs
Publication statusPublished - 2021 Oct 17
Event29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, China
Duration: 2021 Oct 202021 Oct 24

Publication series

NameMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia, MM 2021
Country/TerritoryChina
CityVirtual, Online
Period21/10/2021/10/24

Keywords

  • audio manipulation
  • neural networks
  • text-guided

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Software
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'AMSS-Net: Audio Manipulation on User-Specified Sources with Textual Queries'. Together they form a unique fingerprint.

Cite this