| Improving Text-To-Audio Models with Synthetic Captions | Jun 18, 2024 | AudioCapsAudio captioning | CodeCode Available | 5 |
| AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | Jan 29, 2023 | AudioCapsAudio Generation | CodeCode Available | 4 |
| Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model | Apr 24, 2023 | AudioCapsAudio Generation | CodeCode Available | 3 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning | Jan 31, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance | Sep 2, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| GLAP: General contrastive audio-text pretraining across domains and languages | Jun 12, 2025 | AudioCapsKeyword Spotting | CodeCode Available | 2 |
| ETTA: Elucidating the Design Space of Text-to-Audio Models | Dec 26, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation | May 28, 2024 | AudioCapsAudio Generation | CodeCode Available | 2 |
| Audio Captioning Transformer | Jul 21, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 |
| ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation | Sep 19, 2023 | AudioCapsAudio Generation | CodeCode Available | 1 |
| ADIFF: Explaining audio difference using natural language | Feb 6, 2025 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Audio Retrieval with Natural Language Queries | May 5, 2021 | AudioCapsAudio to Text Retrieval | CodeCode Available | 1 |
| Audio Retrieval with Natural Language Queries: A Benchmark Study | Dec 17, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Audio Retrieval with WavText5K and CLAP Training | Sep 28, 2022 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Bridging Language Gaps in Audio-Text Retrieval | Jun 11, 2024 | AudioCapsRetrieval | CodeCode Available | 1 |
| Can Audio Captions Be Evaluated with Image Caption Metrics? | Oct 10, 2021 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates | Nov 14, 2022 | AudioCapsAudio captioning | CodeCode Available | 1 |
| LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport | Jan 16, 2025 | AudioCapsAudio captioning | CodeCode Available | 1 |
| On Metric Learning for Audio-Text Cross-Modal Retrieval | Mar 29, 2022 | AudioCapsCross-Modal Retrieval | CodeCode Available | 1 |
| Prefix tuning for automated audio captioning | Mar 30, 2023 | AudioCapsAudio captioning | CodeCode Available | 1 |
| RECAP: Retrieval-Augmented Audio Captioning | Sep 18, 2023 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation | May 16, 2024 | AudioCapsEvent Detection | CodeCode Available | 1 |
| Separate What You Describe: Language-Queried Audio Source Separation | Mar 28, 2022 | AudioCapsAudio Source Separation | CodeCode Available | 1 |
| Target Sound Extraction with Variable Cross-modality Clues | Mar 15, 2023 | AudioCapsTarget Sound Extraction | CodeCode Available | 1 |
| Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention | Oct 28, 2022 | AudioCapsAudio captioning | CodeCode Available | 1 |
| Dissecting Temporal Understanding in Text-to-Audio Retrieval | Sep 1, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Audio Captioning with Composition of Acoustic and Semantic Information | May 13, 2021 | AudioCapsAudio captioning | —Unverified | 0 |
| Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning | Oct 14, 2024 | AudioCapsAudio captioning | —Unverified | 0 |
| AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion | May 28, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| Audio-Visual LLM for Video Understanding | Dec 11, 2023 | AudioCapsLanguage Modeling | —Unverified | 0 |
| Automated Audio Captioning via Fusion of Low- and High- Dimensional Features | Oct 10, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 |
| DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval | Sep 16, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment | May 22, 2023 | AudioCapsAudio Generation | —Unverified | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| AudioCaps: Generating Captions for Audios in The Wild | Jun 1, 2019 | AudioCapsAudio captioning | —Unverified | 0 |
| FLAP: Fast Language-Audio Pre-training | Nov 2, 2023 | AudioCapsContrastive Learning | —Unverified | 0 |
| Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval | Jun 22, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Generation or Replication: Auscultating Audio Latent Diffusion Models | Oct 16, 2023 | AudioCapsMemorization | —Unverified | 0 |
| Audiobox: Unified Audio Generation with Natural Language Prompts | Dec 25, 2023 | AudioCapsAudio Generation | —Unverified | 0 |
| IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling | May 31, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| Joint Speech Recognition and Audio Captioning | Feb 3, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval? | Aug 29, 2023 | AudioCapsAudio captioning | —Unverified | 0 |
| Language-based Audio Retrieval with Co-Attention Networks | Dec 30, 2024 | AudioCapsLearning Semantic Representations | —Unverified | 0 |
| TAIL: Text-Audio Incremental Learning | Mar 6, 2025 | AudioCapsIncremental Learning | —Unverified | 0 |
| Leveraging Pre-trained BERT for Audio Captioning | Mar 6, 2022 | AudioCapsAudio captioning | —Unverified | 0 |
| Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning | May 28, 2025 | AudioCapsAudio captioning | —Unverified | 0 |
| Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval | Mar 15, 2024 | AudioCapsContrastive Learning | —Unverified | 0 |
| VoiceLDM: Text-to-Speech with Environmental Context | Sep 24, 2023 | AudioCapstext-to-speech | —Unverified | 0 |