| Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval | Jun 22, 2024 | AudioCapsRetrieval | —Unverified | 0 |
| Improving Text-To-Audio Models with Synthetic Captions | Jun 18, 2024 | AudioCapsAudio captioning | CodeCode Available | 5 |
| MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation | Jun 15, 2024 | AudioCapsImage Generation | CodeCode Available | 0 |
| Bridging Language Gaps in Audio-Text Retrieval | Jun 11, 2024 | AudioCapsRetrieval | CodeCode Available | 1 |
| SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation | May 28, 2024 | AudioCapsAudio Generation | CodeCode Available | 2 |
| Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation | May 16, 2024 | AudioCapsEvent Detection | CodeCode Available | 1 |
| Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval | Mar 15, 2024 | AudioCapsContrastive Learning | —Unverified | 0 |
| Text-to-Audio Generation Synchronized with Videos | Mar 8, 2024 | AudioCapsAudio Generation | —Unverified | 0 |
| EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning | Jan 31, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing | Jan 22, 2024 | AudioCapsAudio-Visual Synchronization | —Unverified | 0 |