SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 2650 of 119 papers

TitleStatusHype
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement LearningCode1
Audio Retrieval with Natural Language Queries: A Benchmark StudyCode1
RECAP: Retrieval-Augmented Audio CaptioningCode1
Visually-Aware Audio Captioning With Adaptive Audio-Visual AttentionCode1
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency InformationCode1
Clotho: An Audio Captioning DatasetCode1
Zero-shot audio captioning with audio-language model guidance and audio context keywordsCode1
MusCaps: Generating Captions for Music AudioCode1
CL4AC: A Contrastive Loss for Audio CaptioningCode1
LAVCap: LLM-based Audio-Visual Captioning using Optimal TransportCode1
THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNINGCode1
Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidatesCode1
Can Audio Captions Be Evaluated with Image Caption Metrics?Code1
A Whisper transformer for audio captioning trained with synthetic captions and transfer learningCode1
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingCode1
Training Audio Captioning Models without AudioCode1
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsCode0
Solla: Towards a Speech-Oriented LLM That Hears Acoustic ContextCode0
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGSCode0
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution AlignmentCode0
Automated Audio Captioning and Language-Based Audio RetrievalCode0
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio SeparationCode0
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio CaptioningCode0
AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive LearningCode0
Language-based Audio Retrieval Task in DCASE 2022 ChallengeCode0
Show:102550
← PrevPage 2 of 5Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified