| TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument | Feb 13, 2025 | Audio GenerationDecoder | CodeCode Available | 2 |
| RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer | Jan 2, 2025 | Audio Generationtext-to-speech | CodeCode Available | 2 |
| EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector | Nov 4, 2024 | DecoderEmotional Speech Synthesis | CodeCode Available | 2 |
| Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis | Oct 30, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier | Oct 28, 2024 | Audio Deepfake DetectionAudio Generation | CodeCode Available | 2 |
| Recent Advances in Speech Language Models: A Survey | Oct 1, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 2 |
| EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Oct 1, 2024 | Emotional Speech SynthesisSpeech Synthesis | CodeCode Available | 2 |
| SafeEar: Content Privacy-Preserving Audio Deepfake Detection | Sep 14, 2024 | Audio Deepfake DetectionDeepFake Detection | CodeCode Available | 2 |
| SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis | Sep 11, 2024 | DecoderSpeech Synthesis | CodeCode Available | 2 |
| IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS | Sep 9, 2024 | DenoisingSpeech Enhancement | CodeCode Available | 2 |
| Sample-Efficient Diffusion for Text-To-Speech Synthesis | Sep 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| TTSDS -- Text-to-Speech Distribution Score | Jul 17, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| CATT: Character-based Arabic Tashkeel Transformer | Jul 3, 2024 | Arabic Text DiacritizationDecoder | CodeCode Available | 2 |
| DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability | Jun 27, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors | Jun 17, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning | Jun 12, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech | Jun 12, 2024 | Emotional Speech Synthesistext-to-speech | CodeCode Available | 2 |
| WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark | Jun 9, 2024 | text-to-speechText to Speech | CodeCode Available | 2 |
| Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis | Jun 6, 2024 | DecoderInductive Bias | CodeCode Available | 2 |
| TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation | May 28, 2024 | Machine Translationspeech-recognition | CodeCode Available | 2 |
| CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations | Apr 10, 2024 | Dialogue Generationtext-to-speech | CodeCode Available | 2 |
| Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness | Apr 10, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models | Mar 31, 2024 | DenoisingSpeech Synthesis | CodeCode Available | 2 |
| An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation | Feb 26, 2024 | Dataset Generationtext-to-speech | CodeCode Available | 2 |
| Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation | Feb 8, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 2 |
| PAM: Prompting Audio-Language Models for Audio Quality Assessment | Feb 1, 2024 | Audio Quality AssessmentMusic Generation | CodeCode Available | 2 |
| DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment | Jan 16, 2024 | DisentanglementSelf-Supervised Learning | CodeCode Available | 2 |
| Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling | Oct 14, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | Oct 7, 2023 | Audio captioningAutomatic Speech Recognition | CodeCode Available | 2 |
| FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec | Sep 14, 2023 | Automatic Speech Recognitionspeech-recognition | CodeCode Available | 2 |
| VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching | Sep 10, 2023 | text-to-speechText to Speech | CodeCode Available | 2 |
| SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models | Aug 31, 2023 | DecoderLanguage Modeling | CodeCode Available | 2 |
| SeamlessM4T: Massively Multilingual & Multimodal Machine Translation | Aug 22, 2023 | Automatic Speech RecognitionMachine Translation | CodeCode Available | 2 |
| VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design | Jul 31, 2023 | Computational Efficiencytext-to-speech | CodeCode Available | 2 |
| CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model | May 11, 2023 | DenoisingGPU | CodeCode Available | 2 |
| Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis | Apr 26, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | Apr 18, 2023 | In-Context LearningSpeech Synthesis | CodeCode Available | 2 |
| PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS | Feb 24, 2023 | Decodertext-to-speech | CodeCode Available | 2 |
| A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech | Feb 8, 2023 | Code GenerationDiversity | CodeCode Available | 2 |
| Towards Building Text-To-Speech Systems for the Next Billion Users | Nov 17, 2022 | DiversitySpeech Synthesis | CodeCode Available | 2 |
| Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform | Oct 28, 2022 | CPUKnowledge Distillation | CodeCode Available | 2 |
| DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech | Jul 3, 2022 | text-to-speechText to Speech | CodeCode Available | 2 |
| StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis | May 30, 2022 | Data AugmentationSelf-Supervised Learning | CodeCode Available | 2 |
| GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech | May 15, 2022 | Speech SynthesisStyle Transfer | CodeCode Available | 2 |
| NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | May 9, 2022 | SentenceSpeech Synthesis | CodeCode Available | 2 |
| FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | Apr 21, 2022 | DenoisingGPU | CodeCode Available | 2 |
| Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation | Mar 29, 2022 | CPUDecoder | CodeCode Available | 2 |
| iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform | Mar 4, 2022 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows | Mar 3, 2022 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| PortaSpeech: Portable and High-Quality Generative Text-to-Speech | Sep 30, 2021 | text-to-speechText to Speech | CodeCode Available | 2 |