SOTAVerified

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

2021-07-06DCASE Challenge 2021Code Available1· sign in to hype

Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6. Our audio captioning system consists of a 10-layer convolution neural network (CNN) encoder and a tempo- ral attentional single layer gated recurrent unit (GRU) decoder. In this challenge, there is no restriction on the usage of external data and pre-trained models. To better model the concepts in an audio clip, we pre-train the CNN encoder with audio tagging on AudioSet. After standard cross entropy based training, we further fine-tune the model with reinforcement learning to directly optimize the evalua- tion metric. Experiments show that our proposed system achieves a SPIDEr of 28.6 on the public evaluation split without ensemble1.

Tasks

Reproductions