| SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Oct 12, 2024 | AudioCapsAudio captioning | —Unverified | 0 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 |
| DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval | Sep 16, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | May 22, 2023 | AudioCapsAudio Generation | —Unverified | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| AudioCaps: Generating Captions for Audios in The Wild | Jun 1, 2019 | AudioCapsAudio captioning | —Unverified | 0 |
| FLAP: Fast Language-Audio Pre-training | Nov 2, 2023 | AudioCapsContrastive Learning | —Unverified | 0 |
| Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval | Jun 22, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Generation or Replication: Auscultating Audio Latent Diffusion Models | Oct 16, 2023 | AudioCapsMemorization | —Unverified | 0 |
| Audiobox: Unified Audio Generation with Natural Language Prompts | Dec 25, 2023 | AudioCapsAudio Generation | —Unverified | 0 |