| Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention | Oct 28, 2022 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Dissecting Temporal Understanding in Text-to-Audio Retrieval | Sep 1, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Audio Captioning with Composition of Acoustic and Semantic Information | May 13, 2021 | AudioCapsAudio captioning | —Unverified | 0 |
| Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning | Oct 14, 2024 | AudioCapsAudio captioning | —Unverified | 0 |
| AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion | May 28, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| Automated Audio Captioning via Fusion of Low- and High- Dimensional Features | Oct 10, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 |
| DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval | Sep 16, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | May 22, 2023 | AudioCapsAudio Generation | —Unverified | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| AudioCaps: Generating Captions for Audios in The Wild | Jun 1, 2019 | AudioCapsAudio captioning | —Unverified | 0 |
| FLAP: Fast Language-Audio Pre-training | Nov 2, 2023 | AudioCapsContrastive Learning | —Unverified | 0 |
| Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval | Jun 22, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Generation or Replication: Auscultating Audio Latent Diffusion Models | Oct 16, 2023 | AudioCapsMemorization | —Unverified | 0 |
| Audiobox: Unified Audio Generation with Natural Language Prompts | Dec 25, 2023 | AudioCapsAudio Generation | —Unverified | 0 |
| IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling | May 31, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| Joint Speech Recognition and Audio Captioning | Feb 3, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval? | Aug 29, 2023 | AudioCapsAudio captioning | —Unverified | 0 |
| Language-based Audio Retrieval with Co-Attention Networks | Dec 30, 2024 | AudioCapsLearning Semantic Representations | —Unverified | 0 |
| TAIL: Text-Audio Incremental Learning | Mar 6, 2025 | AudioCapsIncremental Learning | —Unverified | 0 |
| Leveraging Pre-trained BERT for Audio Captioning | Mar 6, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning | May 28, 2025 | AudioCapsAudio captioning | —Unverified | 0 |
| Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval | Mar 15, 2024 | AudioCapsContrastive Learning | —Unverified | 0 |
| VoiceLDM: Text-to-Speech with Environmental Context | Sep 24, 2023 | AudioCapstext-to-speech | —Unverified | 0 |