| AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion | May 28, 2025 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 | 0 |
| Automated Audio Captioning via Fusion of Low- and High- Dimensional Features | Oct 10, 2022 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 | 0 |
| DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval | Sep 16, 2024 | AudioCapsRetrieval | —Unverified | 0 | 0 |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | May 22, 2023 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Dissecting Temporal Understanding in Text-to-Audio Retrieval | Sep 1, 2024 | AudioCapsRetrieval | —Unverified | 0 | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Audio Captioning with Composition of Acoustic and Semantic Information | May 13, 2021 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning | Oct 14, 2024 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| AudioCaps: Generating Captions for Audios in The Wild | Jun 1, 2019 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| FLAP: Fast Language-Audio Pre-training | Nov 2, 2023 | AudioCapsContrastive Learning | —Unverified | 0 | 0 |
| Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval | Jun 22, 2024 | AudioCapsRetrieval | —Unverified | 0 | 0 |
| Generation or Replication: Auscultating Audio Latent Diffusion Models | Oct 16, 2023 | AudioCapsMemorization | —Unverified | 0 | 0 |