THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS

2021-07-06DCASE workshop 2021Unverified0· sign in to hype

Weiqiang Yuan ∗, Qichen Han∗, Dong Liu, Xiang Li, Zhen Yang

Unverified — Be the first to reproduce this paper.

Abstract

This technical report describes the system participating to the De- tection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use encoder-decoder modeling framework for audio under- standing and caption generation. Our solution focuses on solving two problems in automated audio captioning: data insufficiency and word selection indeterminacy. As limited audios with golden captions are available, we collect large-scale weakly labeled da- taset from Web with heuristic methods. Then we pre-train the en- coder-decoder models with this dataset followed by fine-tuning on Clotho dataset. To solve the word selection indeterminacy problem, we use keywords extracted from captions of similar au- dios and audio event tags produced by pre-trained models to guide words generation in decoding stage. We tested our submissions using the development-testing dataset. Our best submission achieved 31.8 SPIDEr score where that of the baseline system is 5.4.

Tasks

Audio captioning Caption Generation Decoder

THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS

Abstract

Tasks

Reproductions