SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 150 of 119 papers

TitleStatusHype
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation0
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer0
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionCode2
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAPCode0
Solla: Towards a Speech-Oriented LLM That Hears Acoustic ContextCode0
Mellow: a small audio language model for reasoningCode2
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities0
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders0
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning0
ADIFF: Explaining audio difference using natural languageCode1
LLMs can see and hear without any trainingCode3
CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions0
LAVCap: LLM-based Audio-Visual Captioning using Optimal TransportCode1
Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model0
Classifier-Guided Captioning Across Modalities0
ETTA: Elucidating the Design Space of Text-to-Audio ModelsCode2
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language ModelsCode2
Tell What You Hear From What You See -- Video to Audio Generation Through TextCode1
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation0
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning0
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsCode0
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio CaptioningCode0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization0
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution AlignmentCode0
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio SeparationCode0
CLAIR-A: Leveraging Large Language Models to Judge Audio CaptionsCode0
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models0
Towards Diverse and Efficient Audio Captioning via Diffusion Models0
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models0
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning0
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning PerformanceCode2
Taming Data and Transformers for Audio GenerationCode2
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingCode1
Improving Text-To-Audio Models with Synthetic CaptionsCode5
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language ModelsCode2
Audio Dialogues: Dialogues dataset for audio and music understanding0
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs0
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue AbilitiesCode5
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio CaptioningCode2
Learning Audio Concepts from Counterfactual Natural LanguageCode0
AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive LearningCode0
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language ModelsCode3
Zero-shot audio captioning with audio-language model guidance and audio context keywordsCode1
SALMONN: Towards Generic Hearing Abilities for Large Language ModelsCode3
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTCode2
Weakly-supervised Automated Audio Captioning via text only trainingCode0
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified