SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 101119 of 119 papers

TitleStatusHype
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs0
Improving Audio Caption Fluency with Automatic Error Correction0
Exploring Train and Test-Time Augmentations for Audio-Language Learning0
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning0
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics0
Joint Speech Recognition and Audio Captioning0
Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?0
Leveraging Pre-trained BERT for Audio Captioning0
Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation0
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning0
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation0
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity0
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS0
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation0
Towards Diverse and Efficient Audio Captioning via Diffusion Models0
Towards Generating Diverse Audio Captions via Adversarial Training0
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning0
Show:102550
← PrevPage 5 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified