| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 | 5 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 | 5 |
| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 | 5 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 | 5 |
| Audio Retrieval with Natural Language Queries | May 5, 2021 | AudioCapsAudio to Text Retrieval | CodeCode Available | 1 | 5 |
| Cross Modal Retrieval with Querybank Normalisation | Dec 23, 2021 | Cross-Modal RetrievalMetric Learning | CodeCode Available | 1 | 5 |
| The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation | Nov 16, 2023 | Music CaptioningMusic Generation | CodeCode Available | 1 | 5 |
| Audio Retrieval with Natural Language Queries: A Benchmark Study | Dec 17, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 | 5 |
| Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets | Aug 8, 2023 | RetrievalText to Audio Retrieval | CodeCode Available | 0 | 5 |
| Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval | Aug 21, 2024 | AudioCapsContrastive Learning | CodeCode Available | 0 | 5 |