| Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens | Mar 3, 2025 | Attributetext-to-speech | CodeCode Available | 11 |
| IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System | Feb 8, 2025 | DecoderLanguage Modeling | CodeCode Available | 11 |
| F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching | Oct 9, 2024 | Denoisingtext-to-speech | CodeCode Available | 11 |
| CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens | Jul 7, 2024 | Language ModellingLarge Language Model | CodeCode Available | 11 |
| Metis: A Foundation Speech Generation Model with Masked Generative Pre-training | Feb 5, 2025 | Self-Supervised LearningSpeech Enhancement | CodeCode Available | 9 |
| Overview of the Amphion Toolkit (v0.2) | Jan 26, 2025 | text-to-speechText to Speech | CodeCode Available | 9 |
| Moshi: a speech-text foundation model for real-time dialogue | Sep 17, 2024 | Action DetectionActivity Detection | CodeCode Available | 9 |
| MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer | Sep 1, 2024 | Self-Supervised Learningtext-to-speech | CodeCode Available | 9 |
| VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild | Mar 25, 2024 | DecoderLanguage Modeling | CodeCode Available | 9 |
| Natural language guidance of high-fidelity text-to-speech with synthetic annotations | Feb 2, 2024 | In-Context LearningLanguage Modeling | CodeCode Available | 9 |
| Speechless: Speech Instruction Training Without Speech for Low Resource Languages | May 23, 2025 | speech-recognitionSpeech Recognition | CodeCode Available | 7 |
| GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Dec 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 7 |
| Seed-TTS: A Family of High-Quality Versatile Speech Generation Models | Jun 4, 2024 | In-Context LearningLanguage Modelling | CodeCode Available | 7 |
| Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers | Jan 5, 2023 | In-Context LearningLanguage Modeling | CodeCode Available | 7 |
| Better speech synthesis through scaling | May 12, 2023 | Image GenerationSpeech Synthesis | CodeCode Available | 6 |
| ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech | Nov 7, 2022 | Representation LearningSpeech Representation Learning | CodeCode Available | 6 |
| PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit | May 20, 2022 | AllAutomatic Speech Recognition (ASR) | CodeCode Available | 6 |
| Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation | Sep 25, 2024 | text-to-speechText to Speech | CodeCode Available | 5 |
| SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation | Jan 24, 2024 | text-to-speechText to Speech | CodeCode Available | 5 |
| StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models | Jun 13, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 5 |
| XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech | May 31, 2023 | text-to-speechText to Speech | CodeCode Available | 5 |
| Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling | Mar 7, 2023 | In-Context LearningLanguage Modeling | CodeCode Available | 5 |
| Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions | Jan 20, 2023 | text-to-speechText to Speech | CodeCode Available | 5 |
| ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching | Jul 12, 2025 | Dialogue Generationtext-to-speech | CodeCode Available | 4 |
| ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching | Jun 16, 2025 | DecoderSpeech Synthesis | CodeCode Available | 4 |
| Ming-Omni: A Unified Multimodal Model for Perception and Generation | Jun 11, 2025 | Image Generationtext-to-speech | CodeCode Available | 4 |
| VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | May 6, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 4 |
| AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | Aug 10, 2023 | Audio GenerationIn-Context Learning | CodeCode Available | 4 |
| Enhancing Suno's Bark Text-to-Speech Model: Addressing Limitations Through Meta's Encodec and Pre-Trained Hubert | Apr 18, 2023 | Audio GenerationExpressive Speech Synthesis | CodeCode Available | 4 |
| EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge | May 29, 2025 | text-to-speechText to Speech | CodeCode Available | 3 |
| Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | May 5, 2025 | AI AgentAutomatic Speech Recognition | CodeCode Available | 3 |
| MoonCast: High-Quality Zero-Shot Podcast Generation | Mar 18, 2025 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey | Dec 9, 2024 | Speech SynthesisSurvey | CodeCode Available | 3 |
| WavChat: A Survey of Spoken Dialogue Models | Nov 15, 2024 | speech-recognitionSpeech Recognition | CodeCode Available | 3 |
| Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model | Aug 30, 2024 | Audio CompressionAudio Generation | CodeCode Available | 3 |
| PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation | Aug 14, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control | Jun 3, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models | Mar 5, 2024 | QuantizationSpeech Synthesis | CodeCode Available | 3 |
| HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis | Nov 21, 2023 | Speech SynthesisSuper-Resolution | CodeCode Available | 3 |
| ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | Jul 13, 2022 | DenoisingGPU | CodeCode Available | 3 |
| SoundStream: An End-to-End Neural Audio Codec | Jul 7, 2021 | CPUDecoder | CodeCode Available | 3 |
| UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation | Jun 15, 2021 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning | Jul 9, 2019 | Speech Synthesistext-to-speech | CodeCode Available | 3 |
| Differentiable Reward Optimization for LLM based TTS system | Jul 8, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| PresentAgent: Multimodal Agent for Presentation Video Generation | Jul 5, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching | Jun 20, 2025 | SchedulingSpeech Synthesis | CodeCode Available | 2 |
| Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment | May 26, 2025 | text-to-speechText to Speech | CodeCode Available | 2 |
| RWKVTTS: Yet another TTS based on RWKV-7 | Apr 4, 2025 | Computational Efficiencytext-to-speech | CodeCode Available | 2 |
| TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection | Mar 31, 2025 | Fraud DetectionLarge Language Model | CodeCode Available | 2 |
| Scaling Rich Style-Prompted Text-to-Speech Datasets | Mar 6, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |