SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 51100 of 119 papers

TitleStatusHype
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation0
Language-based Audio Retrieval Task in DCASE 2022 Challenge0
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation0
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning0
An Attempt towards Interpretable Audio-Visual Video Captioning0
An investigation on selecting audio pre-trained models for audio captioning0
A Transformer-based Audio Captioning Model with Keyword Estimation0
AudioCaps: Generating Captions for Audios in The Wild0
Audio Captioning using Gated Recurrent Units0
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval0
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning0
Audio Captioning with Composition of Acoustic and Semantic Information0
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model0
Audio Dialogues: Dialogues dataset for audio and music understanding0
Audio Difference Learning for Audio Captioning0
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities0
Automated Audio Captioning: An Overview of Recent Progress and New Challenges0
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization0
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features0
Automated Audio Captioning with Epochal Difficult Captions for Curriculum Learning0
Automated Audio Captioning with Recurrent Neural Networks0
Automatic Audio Captioning using Attention weighted Event based Embeddings0
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer0
Classifier-Guided Captioning Across Modalities0
CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions0
Diverse Audio Captioning via Adversarial Training0
Diversity and bias in audio captioning datasets0
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning0
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning0
Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning0
Efficient Audio Captioning Transformer with Patchout and Text Guidance0
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation0
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization0
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders0
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models0
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning0
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning0
Generating Realistic Images from In-the-wild Sounds0
Impact of visual assistance for automated audio captioning0
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs0
Improving Audio Caption Fluency with Automatic Error Correction0
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer0
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity0
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS0
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation0
Towards Diverse and Efficient Audio Captioning via Diffusion Models0
Towards Generating Diverse Audio Captions via Adversarial Training0
Show:102550
← PrevPage 2 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified