SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 101119 of 119 papers

TitleStatusHype
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio CaptioningCode0
Automated Audio Captioning and Language-Based Audio RetrievalCode0
AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive LearningCode0
Audio Difference Captioning Utilizing Similarity-Discrepancy DisentanglementCode0
Language-based Audio Retrieval Task in DCASE 2022 ChallengeCode0
Weakly-supervised Automated Audio Captioning via text only trainingCode0
Audio Caption in a Car Setting with a Sentence-Level LossCode0
Crowdsourcing and Evaluating Text-Based Audio Retrieval RelevancesCode0
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution AlignmentCode0
Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting ApproachCode0
CLAIR-A: Leveraging Large Language Models to Judge Audio CaptionsCode0
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio SeparationCode0
Multi-task Regularization Based on Infrequent Classes for Audio CaptioningCode0
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAPCode0
Caption Feature Space Regularization for Audio CaptioningCode0
Local Information Assisted Attention-free Decoder for Audio CaptioningCode0
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsCode0
Solla: Towards a Speech-Oriented LLM That Hears Acoustic ContextCode0
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGSCode0
Show:102550
← PrevPage 5 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified