| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 | 5 |
| ImageBind: One Embedding Space To Bind Them All | May 9, 2023 | AllCross-Modal Retrieval | CodeCode Available | 5 | 5 |
| LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment | Oct 3, 2023 | Audio ClassificationContrastive Learning | CodeCode Available | 4 | 5 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 | 5 |
| WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research | Mar 30, 2023 | Audio captioningEvent Detection | CodeCode Available | 2 | 5 |
| Learning Audio-Video Modalities from Image Captions | Apr 1, 2022 | Image CaptioningRetrieval | —Unverified | 0 | 0 |