| AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion | May 28, 2025 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 | 0 |
| Automated Audio Captioning via Fusion of Low- and High- Dimensional Features | Oct 10, 2022 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Oct 12, 2024 | AudioCapsAudio captioning | —Unverified | 0 | 0 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 | 0 |
| DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval | Sep 16, 2024 | AudioCapsRetrieval | —Unverified | 0 | 0 |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | May 22, 2023 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Dissecting Temporal Understanding in Text-to-Audio Retrieval | Sep 1, 2024 | AudioCapsRetrieval | —Unverified | 0 | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 | 0 |
| Audio Captioning with Composition of Acoustic and Semantic Information | May 13, 2021 | AudioCapsAudio captioning | —Unverified | 0 | 0 |