SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 101119 of 119 papers

TitleStatusHype
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGSCode0
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning0
Diverse Audio Captioning via Adversarial Training0
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization0
Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting ApproachCode0
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS0
Audio Captioning with Composition of Acoustic and Semantic Information0
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval0
Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning0
Multi-task Regularization Based on Infrequent Classes for Audio CaptioningCode0
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio CaptioningCode0
A Transformer-based Audio Captioning Model with Keyword Estimation0
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation0
Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation0
Audio Captioning using Gated Recurrent Units0
AudioCaps: Generating Captions for Audios in The Wild0
Audio Caption in a Car Setting with a Sentence-Level LossCode0
An Attempt towards Interpretable Audio-Visual Video Captioning0
Automated Audio Captioning with Recurrent Neural Networks0
Show:102550
← PrevPage 5 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified