| Improving Text-To-Audio Models with Synthetic Captions | Jun 18, 2024 | AudioCapsAudio captioning | CodeCode Available | 5 |
| AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | Jan 29, 2023 | AudioCapsAudio Generation | CodeCode Available | 4 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model | Apr 24, 2023 | AudioCapsAudio Generation | CodeCode Available | 3 |
| GLAP: General contrastive audio-text pretraining across domains and languages | Jun 12, 2025 | AudioCapsKeyword Spotting | CodeCode Available | 2 |
| ETTA: Elucidating the Design Space of Text-to-Audio Models | Dec 26, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance | Sep 2, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation | May 28, 2024 | AudioCapsAudio Generation | CodeCode Available | 2 |
| EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning | Jan 31, 2024 | AudioCapsAudio captioning | CodeCode Available | 2 |
| ADIFF: Explaining audio difference using natural language | Feb 6, 2025 | AudioCapsAudio captioning | CodeCode Available | 1 |