SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 150 of 119 papers

TitleStatusHype
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue AbilitiesCode5
Improving Text-To-Audio Models with Synthetic CaptionsCode5
LLMs can see and hear without any trainingCode3
SALMONN: Towards Generic Hearing Abilities for Large Language ModelsCode3
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language ModelsCode3
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language ModelsCode2
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchCode2
Taming Data and Transformers for Audio GenerationCode2
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio CaptioningCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language ModelsCode2
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
Mellow: a small audio language model for reasoningCode2
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTCode2
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning PerformanceCode2
Pengi: An Audio Language Model for Audio TasksCode2
ETTA: Elucidating the Design Space of Text-to-Audio ModelsCode2
THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNINGCode1
ADIFF: Explaining audio difference using natural languageCode1
Zero-shot audio captioning with audio-language model guidance and audio context keywordsCode1
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement LearningCode1
Audio Retrieval with WavText5K and CLAP TrainingCode1
Visually-Aware Audio Captioning With Adaptive Audio-Visual AttentionCode1
WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency InformationCode1
Training Audio Captioning Models without AudioCode1
Audio Captioning TransformerCode1
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingCode1
Audio Retrieval with Natural Language Queries: A Benchmark StudyCode1
Clotho: An Audio Captioning DatasetCode1
RECAP: Retrieval-Augmented Audio CaptioningCode1
Tell What You Hear From What You See -- Video to Audio Generation Through TextCode1
Multimodal Knowledge Alignment with Reinforcement LearningCode1
CL4AC: A Contrastive Loss for Audio CaptioningCode1
Prefix tuning for automated audio captioningCode1
Can Audio Captions Be Evaluated with Image Caption Metrics?Code1
MusCaps: Generating Captions for Music AudioCode1
A Whisper transformer for audio captioning trained with synthetic captions and transfer learningCode1
Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidatesCode1
LAVCap: LLM-based Audio-Visual Captioning using Optimal TransportCode1
Automated Audio Captioning with Epochal Difficult Captions for Curriculum Learning0
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features0
Audio Captioning with Composition of Acoustic and Semantic Information0
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization0
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning0
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models0
Automated Audio Captioning: An Overview of Recent Progress and New Challenges0
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval0
Audio Captioning using Gated Recurrent Units0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified
#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.52Unverified
2VALORCIDEr0.42Unverified
3SLAM-AACSPIDEr0.33Unverified
4LOAESPIDEr0.33Unverified
5MQ-CapSPIDEr0.32Unverified
6EnsembleSPIDEr0.32Unverified
7Audio Flamingo (Pengi trainset)SPIDEr0.31Unverified
8Ensemble-RLSPIDEr0.3Unverified
9Qwen-AudioSPIDEr0.29Unverified
10EnsembleSPIDEr0.21Unverified