| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models | Nov 28, 2024 | Audio captioningAudio to Text Retrieval | CodeCode Available | 2 |
| Contrastive Audio-Language Learning for Music | Aug 25, 2022 | Audio to Text RetrievalDescriptive | CodeCode Available | 1 |
| Audio Retrieval with Natural Language Queries: A Benchmark Study | Dec 17, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Audio Retrieval with Natural Language Queries | May 5, 2021 | AudioCapsAudio to Text Retrieval | CodeCode Available | 1 |
| M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP | Mar 28, 2025 | Audio captioningAudio Classification | —Unverified | 0 |
| Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval? | Aug 29, 2023 | AudioCapsAudio captioning | —Unverified | 0 |
| On Negative Sampling for Contrastive Audio-Text Retrieval | Nov 8, 2022 | Audio to Text RetrievalContrastive Learning | —Unverified | 0 |
| Exploring Train and Test-Time Augmentations for Audio-Language Learning | Oct 31, 2022 | Audio captioningAudio to Text Retrieval | —Unverified | 0 |
| OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation | Jul 1, 2021 | Audio to Text RetrievalCross-Modal Retrieval | CodeCode Available | 0 |