| Large-Scale Automatic Audiobook Creation | Sep 7, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| GRASS: Unified Generation Model for Speech-to-Semantic Tasks | Sep 6, 2023 | named-entity-recognitionNamed Entity Recognition | —Unverified | 0 |
| MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 | Sep 6, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| PromptTTS 2: Describing and Generating Voices with Text Prompt | Sep 5, 2023 | Language ModellingLarge Language Model | —Unverified | 0 |
| A Comparative Analysis of Pretrained Language Models for Text-to-Speech | Sep 4, 2023 | Natural Language UnderstandingPrediction | —Unverified | 0 |
| The FruitShell French synthesis system at the Blizzard 2023 Challenge | Sep 1, 2023 | Data AugmentationSpeech Synthesis | —Unverified | 0 |
| Learning Speech Representation From Contrastive Token-Acoustic Pretraining | Sep 1, 2023 | Audio ClassificationAutomatic Speech Recognition | —Unverified | 0 |
| Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information | Aug 31, 2023 | DecoderMulti-Task Learning | —Unverified | 0 |
| Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis | Aug 31, 2023 | Expressive Speech SynthesisSentence | —Unverified | 0 |
| The DeepZen Speech Synthesis System for Blizzard Challenge 2023 | Aug 30, 2023 | SentenceSpeech Synthesis | —Unverified | 0 |
| Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech | Aug 28, 2023 | Domain Generalizationtext-to-speech | —Unverified | 0 |
| Rep2wav: Noise Robust text-to-speech Using self-supervised representations | Aug 28, 2023 | Speech Enhancementtext-to-speech | —Unverified | 0 |
| Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations | Aug 24, 2023 | Representation LearningSpeech Synthesis | —Unverified | 0 |
| Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models | Aug 21, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis | Aug 16, 2023 | AttributeSpeech Synthesis | —Unverified | 0 |
| SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | Aug 14, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation | Aug 12, 2023 | Talking Head Generationtext-to-speech | CodeCode Available | 0 |
| Let's Give a Voice to Conversational Agents in Virtual Reality | Aug 4, 2023 | Speech-to-Texttext-to-speech | CodeCode Available | 0 |
| SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis | Aug 2, 2023 | DecoderSelf-Supervised Learning | —Unverified | 0 |
| Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech | Jul 31, 2023 | Acoustic ModellingSpeech Synthesis | —Unverified | 0 |
| Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings | Jul 31, 2023 | Grapheme-to-Phoneme Conversionspeech-recognition | —Unverified | 0 |
| Multilingual context-based pronunciation learning for Text-to-Speech | Jul 31, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer | Jul 29, 2023 | DisentanglementDiversity | —Unverified | 0 |
| Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding | Jul 28, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs | Jul 18, 2023 | Generative Adversarial NetworkLanguage Modeling | —Unverified | 0 |
| Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis | Jul 14, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Controllable Emphasis with zero data for text-to-speech | Jul 13, 2023 | Sentencetext-to-speech | —Unverified | 0 |
| On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis | Jul 11, 2023 | PredictionSelf-Supervised Learning | —Unverified | 0 |
| Artificial Eye for the Blind | Jul 7, 2023 | Objectobject-detection | —Unverified | 0 |
| ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading | Jul 3, 2023 | FormSentence | —Unverified | 0 |
| High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units | Jun 29, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech | Jun 27, 2023 | DisentanglementStyle Generalization | —Unverified | 0 |
| DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech | Jun 25, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | Jun 23, 2023 | In-Context LearningSpeech Synthesis | CodeCode Available | 0 |
| Visual-Aware Text-to-Speech | Jun 21, 2023 | RhythmSpeech Synthesis | —Unverified | 0 |
| Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer | Jun 20, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation | Jun 16, 2023 | Data Augmentationtext-to-speech | —Unverified | 0 |
| CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages | Jun 16, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation | Jun 14, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling | Jun 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech | Jun 9, 2023 | Emotion RecognitionSpeech Emotion Recognition | —Unverified | 0 |
| VIFS: An End-to-End Variational Inference for Foley Sound Synthesis | Jun 8, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 0 |
| Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias | Jun 6, 2023 | AttributeInductive Bias | —Unverified | 0 |
| Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis | Jun 6, 2023 | Neural Renderingtext-to-speech | —Unverified | 0 |
| Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model | Jun 5, 2023 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 |
| Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming | Jun 5, 2023 | Bayesian InferenceSinging Voice Synthesis | CodeCode Available | 0 |
| Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis | Jun 5, 2023 | RhythmSentence | —Unverified | 0 |
| Towards Robust FastSpeech 2 by Modelling Residual Multimodality | Jun 2, 2023 | Decodertext-to-speech | —Unverified | 0 |
| The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech | Jun 1, 2023 | Cross-Lingual Transfertext-to-speech | —Unverified | 0 |
| Text-to-Speech Pipeline for Swiss German -- A comparison | May 31, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |