THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

2021-07-06DCASE Challenge 2021Code Available1· sign in to hype

Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu

Code Available — Be the first to reproduce this paper.

Code

github.com/wsntxxn/AudioCaption
pytorch★ 51

Abstract

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.

Tasks

Audio captioning Audio Tagging Decoder reinforcement-learning Reinforcement Learning (RL)

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

Code

Abstract

Tasks

Reproductions