SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 51100 of 119 papers

TitleStatusHype
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation0
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning0
An Attempt towards Interpretable Audio-Visual Video Captioning0
An investigation on selecting audio pre-trained models for audio captioning0
A Transformer-based Audio Captioning Model with Keyword Estimation0
AudioCaps: Generating Captions for Audios in The Wild0
Audio Captioning using Gated Recurrent Units0
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval0
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning0
Audio Captioning with Composition of Acoustic and Semantic Information0
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model0
Audio Dialogues: Dialogues dataset for audio and music understanding0
Audio Difference Learning for Audio Captioning0
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities0
Automated Audio Captioning: An Overview of Recent Progress and New Challenges0
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization0
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features0
Automated Audio Captioning with Epochal Difficult Captions for Curriculum Learning0
Automated Audio Captioning with Recurrent Neural Networks0
Automatic Audio Captioning using Attention weighted Event based Embeddings0
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer0
Classifier-Guided Captioning Across Modalities0
CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions0
Diverse Audio Captioning via Adversarial Training0
Diversity and bias in audio captioning datasets0
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning0
Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning0
Efficient Audio Captioning Transformer with Patchout and Text Guidance0
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation0
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization0
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders0
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models0
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning0
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning0
Generating Realistic Images from In-the-wild Sounds0
Impact of visual assistance for automated audio captioning0
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs0
Improving Audio Caption Fluency with Automatic Error Correction0
Exploring Train and Test-Time Augmentations for Audio-Language Learning0
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity0
THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS0
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation0
Towards Diverse and Efficient Audio Captioning via Diffusion Models0
Towards Generating Diverse Audio Captions via Adversarial Training0
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning0
Learning Audio Concepts from Counterfactual Natural LanguageCode0
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio CaptioningCode0
Show:102550
← PrevPage 2 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified