| VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | May 29, 2023 | Audio captioningAudio-Visual Captioning | CodeCode Available | 2 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Apr 17, 2023 | Audio captioningAudio-Video Question Answering (AVQA) | CodeCode Available | 2 |
| Data leakage in cross-modal retrieval training: A case study | Feb 23, 2023 | Cross-Modal RetrievalRetrieval | —Unverified | 0 |
| Exploring Train and Test-Time Augmentations for Audio-Language Learning | Oct 31, 2022 | Audio captioningAudio to Text Retrieval | —Unverified | 0 |
| Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval | Oct 6, 2022 | Metric LearningRetrieval | —Unverified | 0 |
| Cross Modal Retrieval with Querybank Normalisation | Dec 23, 2021 | Cross-Modal RetrievalMetric Learning | CodeCode Available | 1 |
| Audio Retrieval with Natural Language Queries: A Benchmark Study | Dec 17, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 |
| OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation | Jul 1, 2021 | Audio to Text RetrievalCross-Modal Retrieval | CodeCode Available | 0 |
| Audio Retrieval with Natural Language Queries | May 5, 2021 | AudioCapsAudio to Text Retrieval | CodeCode Available | 1 |