| FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation | Jul 11, 2025 | Audio GenerationData Augmentation | —Unverified | 0 |
| ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing | Jun 26, 2025 | Audio GenerationLarge Language Model | CodeCode Available | 5 |
| Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance | Jun 26, 2025 | Audio GenerationAudio Synthesis | —Unverified | 0 |
| Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation | Jun 24, 2025 | Audio GenerationAudio-Visual Synchronization | —Unverified | 0 |
| ViSAGe: Video-to-Spatial Audio Generation | Jun 13, 2025 | Audio Generation | —Unverified | 0 |
| LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation | Jun 13, 2025 | Audio Generation | —Unverified | 0 |
| BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation | Jun 11, 2025 | Audio GenerationFAD | CodeCode Available | 1 |
| A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations | Jun 6, 2025 | Audio GenerationText Generation | —Unverified | 0 |
| Sounding that Object: Interactive Object-Aware Image to Audio Generation | Jun 4, 2025 | Audio GenerationImage Segmentation | —Unverified | 0 |
| InfiniteAudio: Infinite-Length Audio Generation with Consistency | Jun 3, 2025 | Audio GenerationDenoising | —Unverified | 0 |
| DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization | Jun 3, 2025 | Audio GenerationAudio Source Separation | —Unverified | 0 |
| XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark | May 31, 2025 | Audio GenerationFace Swapping | CodeCode Available | 0 |
| IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling | May 31, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion | May 28, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance | May 27, 2025 | Audio GenerationDenoising | CodeCode Available | 0 |
| EnvSDD: Benchmarking Environmental Sound Deepfake Detection | May 25, 2025 | Audio Deepfake DetectionAudio Generation | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |
| Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio | May 19, 2025 | Audio GenerationInformation Retrieval | —Unverified | 0 |
| DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis | May 14, 2025 | Audio GenerationAudio Synthesis | —Unverified | 0 |
| Fast Text-to-Audio Generation with Adversarial Post-Training | May 13, 2025 | ARCAudio Generation | CodeCode Available | 7 |
| TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining | May 12, 2025 | Audio captioningAudio Generation | —Unverified | 0 |
| Discrete Optimal Transport and Voice Conversion | May 7, 2025 | Audio GenerationVoice Conversion | —Unverified | 0 |
| Wasserstein Convergence of Score-based Generative Models under Semiconvexity and Discontinuous Gradients | May 6, 2025 | Audio GenerationDenoising | —Unverified | 0 |
| OmniAudio: Generating Spatial Audio from 360-Degree Video | Apr 21, 2025 | Audio Generation | CodeCode Available | 3 |
| On the Design of Diffusion-based Neural Speech Codecs | Apr 11, 2025 | Audio GenerationImage Generation | —Unverified | 0 |
| Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception | Apr 9, 2025 | AllAudio Deepfake Detection | CodeCode Available | 1 |
| Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models | Apr 6, 2025 | Audio GenerationGPU | —Unverified | 0 |
| Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization | Mar 28, 2025 | Audio GenerationFAD | CodeCode Available | 1 |
| DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos | Mar 28, 2025 | Audio GenerationLarge Language Model | —Unverified | 0 |
| DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation | Mar 28, 2025 | Audio GenerationAudio-Visual Synchronization | —Unverified | 0 |
| Make Some Noise: Towards LLM audio reasoning and generation using sound tokens | Mar 28, 2025 | Audio GenerationQuantization | —Unverified | 0 |
| DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap | Mar 15, 2025 | AudioCapsAudio Generation | —Unverified | 0 |
| AudioX: Diffusion Transformer for Anything-to-Audio Generation | Mar 13, 2025 | Audio GenerationMusic Generation | —Unverified | 0 |
| TA-V2A: Textually Assisted Video-to-Audio Generation | Mar 12, 2025 | Audio Generation | —Unverified | 0 |
| Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition | Mar 10, 2025 | Audio GenerationQuantization | —Unverified | 0 |
| ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation | Mar 10, 2025 | Audio Generation | —Unverified | 0 |
| Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder | Mar 9, 2025 | Audio GenerationDenoising | —Unverified | 0 |
| PodAgent: A Comprehensive Framework for Podcast Generation | Mar 1, 2025 | Audio GenerationSpeech Synthesis | CodeCode Available | 2 |
| InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation | Feb 28, 2025 | Audio GenerationForm | CodeCode Available | 5 |
| DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model | Feb 26, 2025 | Audio GenerationLarge Language Model | —Unverified | 0 |
| KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation | Feb 21, 2025 | Audio GenerationFAD | CodeCode Available | 2 |
| Towards efficient quantum algorithms for diffusion probability models | Feb 20, 2025 | Audio Generation | —Unverified | 0 |
| AudioSpa: Spatializing Sound Events with Text | Feb 16, 2025 | Audio GenerationData Augmentation | —Unverified | 0 |
| TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument | Feb 13, 2025 | Audio GenerationDecoder | CodeCode Available | 2 |
| Latent Swap Joint Diffusion for 2D Long-Form Latent Generation | Feb 7, 2025 | Audio GenerationDenoising | CodeCode Available | 4 |
| ADIFF: Explaining audio difference using natural language | Feb 6, 2025 | AudioCapsAudio captioning | CodeCode Available | 1 |
| UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation | Feb 6, 2025 | Audio GenerationDiversity | —Unverified | 0 |
| AudioGenX: Explainability on Text-to-Audio Generative Models | Feb 1, 2025 | Audio Generationcounterfactual | CodeCode Available | 0 |
| CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions | Jan 28, 2025 | Audio captioningAudio Generation | —Unverified | 0 |
| Baichuan-Omni-1.5 Technical Report | Jan 26, 2025 | Audio Generation | CodeCode Available | 2 |