SOTAVerified

Zero-shot Text to Audio Retrieval

Papers

Showing 16 of 6 papers

TitleStatusHype
InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingCode7
ImageBind: One Embedding Space To Bind Them AllCode5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentCode4
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchCode2
Learning Audio-Video Modalities from Image Captions0
Show:102550

No leaderboard results yet.