| Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling | Oct 14, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| Crowdsourced and Automatic Speech Prominence Estimation | Oct 12, 2023 | Emotion Recognitiontext-to-speech | CodeCode Available | 1 |
| On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition | Oct 12, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Prosody Analysis of Audiobooks | Oct 10, 2023 | AttributeLanguage Modeling | CodeCode Available | 0 |
| Neutral TTS Female Voice Corpus in Brazilian Portuguese | Oct 8, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Unified speech and gesture synthesis using flow matching | Oct 8, 2023 | Audio SynthesisMotion Synthesis | —Unverified | 0 |
| Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset | Oct 8, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | Oct 7, 2023 | Audio captioningAutomatic Speech Recognition | CodeCode Available | 2 |
| Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis | Oct 5, 2023 | Data AugmentationSpeech Synthesis | —Unverified | 0 |
| The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains | Oct 4, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Towards human-like spoken dialogue generation between AI agents from written dialogue | Oct 2, 2023 | Dialogue Generationtext-to-speech | —Unverified | 0 |
| Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech | Oct 1, 2023 | speech-recognitionSpeech Recognition | CodeCode Available | 1 |
| Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features | Sep 29, 2023 | Synthetic Speech Detectiontext-to-speech | —Unverified | 0 |
| Low-Resource Self-Supervised Learning with SSL-Enhanced TTS | Sep 29, 2023 | Self-Supervised Learningtext-to-speech | —Unverified | 0 |
| High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models | Sep 27, 2023 | AllSpeech Synthesis | —Unverified | 0 |
| Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping | Sep 25, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| BiSinger: Bilingual Singing Voice Synthesis | Sep 25, 2023 | Singing Voice Synthesistext-to-speech | CodeCode Available | 1 |
| VoiceLDM: Text-to-Speech with Environmental Context | Sep 24, 2023 | AudioCapstext-to-speech | —Unverified | 0 |
| DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis | Sep 22, 2023 | DenoisingSpeech Synthesis | —Unverified | 0 |
| Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech | Sep 21, 2023 | text-to-speechText to Speech | CodeCode Available | 1 |
| The Impact of Silence on Speech Anti-Spoofing | Sep 21, 2023 | Action DetectionActivity Detection | —Unverified | 0 |
| Speak While You Think: Streaming Speech Synthesis During Text Generation | Sep 20, 2023 | Speech SynthesisText Generation | —Unverified | 0 |
| Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model | Sep 20, 2023 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| Exploring Speech Enhancement for Low-resource Speech Synthesis | Sep 19, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition | Sep 19, 2023 | Data AugmentationEmotion Recognition | —Unverified | 0 |
| Augmenting text for spoken language understanding with Large Language Models | Sep 17, 2023 | Semantic ParsingSpoken Language Understanding | —Unverified | 0 |
| HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods | Sep 15, 2023 | Audio Deepfake DetectionDeepFake Detection | CodeCode Available | 1 |
| PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions | Sep 15, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech | Sep 15, 2023 | Knowledge DistillationSpeech Synthesis | —Unverified | 0 |
| FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec | Sep 14, 2023 | Automatic Speech Recognitionspeech-recognition | CodeCode Available | 2 |
| Direct Text to Speech Translation System using Acoustic Units | Sep 14, 2023 | DecoderSpeech-to-Speech Translation | —Unverified | 0 |
| Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP | Sep 11, 2023 | text-to-speechText to Speech | CodeCode Available | 1 |
| VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching | Sep 10, 2023 | text-to-speechText to Speech | CodeCode Available | 2 |
| Cross-Utterance Conditioned VAE for Speech Generation | Sep 8, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Large-Scale Automatic Audiobook Creation | Sep 7, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 | Sep 6, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| GRASS: Unified Generation Model for Speech-to-Semantic Tasks | Sep 6, 2023 | named-entity-recognitionNamed Entity Recognition | —Unverified | 0 |
| PromptTTS 2: Describing and Generating Voices with Text Prompt | Sep 5, 2023 | Language ModellingLarge Language Model | —Unverified | 0 |
| A Comparative Analysis of Pretrained Language Models for Text-to-Speech | Sep 4, 2023 | Natural Language UnderstandingPrediction | —Unverified | 0 |
| The FruitShell French synthesis system at the Blizzard 2023 Challenge | Sep 1, 2023 | Data AugmentationSpeech Synthesis | —Unverified | 0 |
| Learning Speech Representation From Contrastive Token-Acoustic Pretraining | Sep 1, 2023 | Audio ClassificationAutomatic Speech Recognition | —Unverified | 0 |
| QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning | Aug 31, 2023 | Representation LearningSpeech Representation Learning | CodeCode Available | 1 |
| SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models | Aug 31, 2023 | DecoderLanguage Modeling | CodeCode Available | 2 |
| Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis | Aug 31, 2023 | Expressive Speech SynthesisSentence | —Unverified | 0 |
| Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information | Aug 31, 2023 | DecoderMulti-Task Learning | —Unverified | 0 |
| The DeepZen Speech Synthesis System for Blizzard Challenge 2023 | Aug 30, 2023 | SentenceSpeech Synthesis | —Unverified | 0 |
| Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech | Aug 28, 2023 | Domain Generalizationtext-to-speech | —Unverified | 0 |
| TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models | Aug 28, 2023 | Language Modellingtext-to-speech | CodeCode Available | 1 |
| Rep2wav: Noise Robust text-to-speech Using self-supervised representations | Aug 28, 2023 | Speech Enhancementtext-to-speech | —Unverified | 0 |
| Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations | Aug 24, 2023 | Representation LearningSpeech Synthesis | —Unverified | 0 |