THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING
Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/wsntxxn/AudioCaptionpytorch★ 51
Abstract
This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.