SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 101119 of 119 papers

TitleStatusHype
Improving Audio Caption Fluency with Automatic Error Correction0
Exploring Train and Test-Time Augmentations for Audio-Language Learning0
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning0
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics0
Joint Speech Recognition and Audio Captioning0
Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?0
Leveraging Pre-trained BERT for Audio Captioning0
Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation0
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP0
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning0
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation0
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer0
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity0
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS0
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation0
Towards Diverse and Efficient Audio Captioning via Diffusion Models0
Towards Generating Diverse Audio Captions via Adversarial Training0
Show:102550
← PrevPage 3 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified