SOTAVerified

Audio Generation

Audio generation (synthesis) is the task of generating raw audio such as speech.

( Image credit: MelNet )

Papers

Showing 51100 of 270 papers

TitleStatusHype
ETTA: Elucidating the Design Space of Text-to-Audio ModelsCode2
WavMark: Watermarking for Audio GenerationCode2
Taming Data and Transformers for Audio GenerationCode2
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio GenerationCode2
RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented TransformerCode2
Unsupervised Source Separation By Steering Pretrained Music ModelsCode1
Adversarial Audio SynthesisCode1
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation ModelsCode1
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency DistillationCode1
RiTTA: Modeling Event Relations in Text-to-Audio GenerationCode1
BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio GenerationCode1
Unconditional Audio Generation with Generative Adversarial Networks and Cycle RegularizationCode1
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based ControlsCode1
WaveNet: A Generative Model for Raw AudioCode1
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene SynthesisCode1
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversionCode1
ADIFF: Explaining audio difference using natural languageCode1
Any-to-Any Generation via Composable DiffusionCode1
Read, Watch and Scream! Sound Generation from Text and VideoCode1
Perceiving Music Quality with GANsCode1
Anytime Sampling for Autoregressive Models via Ordered AutoencodingCode1
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial TrainingCode1
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity ResponsesCode1
T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound SynthesisCode1
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale CorpusCode1
Differentiable Time-Frequency Scattering on GPUCode1
MMTrail: A Multimodal Trailer Video Dataset with Language and Music DescriptionsCode1
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio GenerationCode1
Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory PerceptionCode1
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup StrategiesCode1
Tell What You Hear From What You See -- Video to Audio Generation Through TextCode1
Localize to Binauralize: Audio Spatialization From Visual Sound Source LocalizationCode1
LLMBind: A Unified Modality-Task Integration FrameworkCode1
An Efficient Membership Inference Attack for the Diffusion Model by Proximal InitializationCode1
LooPy: A Research-Friendly Mix Framework for Music Information Retrieval on Electronic Dance MusicCode1
LAFMA: A Latent Flow Matching Model for Text-to-Audio GenerationCode1
LiteFocus: Accelerated Diffusion Inference for Long Audio SynthesisCode1
Taming Visually Guided Sound GenerationCode1
Temporally Aligned Audio for Video with AutoregressionCode1
HiFi++: a Unified Framework for Bandwidth Extension and Speech EnhancementCode1
GACELA -- A generative adversarial context encoder for long audio inpaintingCode1
Invisible Watermarking for Audio Generation Diffusion ModelsCode1
It's Raw! Audio Generation with State-Space ModelsCode1
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and GenerationCode1
Speech collage: code-switched audio generation by collaging monolingual corporaCode1
Neural Waveshaping SynthesisCode1
Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio GenerationCode1
Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference OptimizationCode1
Audeo: Audio Generation for a Silent Performance VideoCode1
Catch-A-Waveform: Learning to Generate Audio from a Single Short ExampleCode1
Show:102550
← PrevPage 2 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1AudioGenFD_openl3185.53Unverified
2AudioLDM2-largeFD_openl3158.04Unverified
3Stable Audio 2.0FD_openl3110.62Unverified
4Stable AudioFD_openl3103.66Unverified
5ETTAFD_openl380.13Unverified
6TangoFlux-baseFD_openl379.7Unverified
7Stable Audio OpenFD_openl378.24Unverified
8TangoFluxFD_openl375.1Unverified
9ETTA-FT-AC-100kFD_openl361.79Unverified
10DiffsoundFAD7.75Unverified
#ModelMetricClaimedVerifiedStatus
1VAB-Encodec (Ours)Bits per byte40Unverified
2Sparse Transformer 152M (strided)Bits per byte1.97Unverified
#ModelMetricClaimedVerifiedStatus
1SymphonyNet Human listening average results3.5Unverified