SOTAVerified

Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Papers

Showing 110 of 119 papers

TitleStatusHype
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
AC/DC: LLM-based Audio Comprehension via Dialogue Continuation0
CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer0
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionCode2
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning0
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining0
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAPCode0
Solla: Towards a Speech-Oriented LLM That Hears Acoustic ContextCode0
Mellow: a small audio language model for reasoningCode2
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities0
Show:102550
← PrevPage 1 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VASTCIDEr0.78Unverified
2VALORCIDEr0.74Unverified
3MQ-CapSPIDEr0.52Unverified
4SLAM-AACSPIDEr0.52Unverified
5LAVCapSPIDEr0.52Unverified
6EnCLAP++-largeSPIDEr0.51Unverified
7AutoCapSPIDEr0.51Unverified
8LOAESPIDEr0.51Unverified
9EnCLAP++-baseSPIDEr0.5Unverified
10EnCLAP-largeSPIDEr0.5Unverified