| Does Audio Deepfake Detection Generalize? | Mar 30, 2022 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 | 0 |
| Do Prosody Transfer Models Transfer Prosody? | Mar 7, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech | Sep 18, 2024 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| DPP-TTS: Diversifying prosodic features of speech via determinantal point processes | Oct 23, 2023 | DiversityPoint Processes | —Unverified | 0 | 0 |
| DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech | Jun 25, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction | Mar 1, 2023 | Dynamic Time WarpingMetric Learning | —Unverified | 0 | 0 |
| Dual Audio-Centric Modality Coupling for Talking Head Generation | Mar 26, 2025 | NeRFTalking Head Generation | —Unverified | 0 | 0 |
| Dual Script E2E framework for Multilingual and Code-Switching ASR | Jun 2, 2021 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance | Aug 26, 2024 | Diversitytext-to-speech | —Unverified | 0 | 0 |
| Dual Supervised Learning | Jul 3, 2017 | General Classificationimage-classification | —Unverified | 0 | 0 |
| DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing | Jun 13, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech | Feb 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Oct 17, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis | Sep 22, 2023 | DenoisingSpeech Synthesis | —Unverified | 0 | 0 |
| Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection | Dec 2, 2019 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| E1 TTS: Simple and Fast Non-Autoregressive TTS | Sep 14, 2024 | Denoisingtext-to-speech | —Unverified | 0 | 0 |
| E3 TTS: Easy End-to-End Diffusion-based Text to Speech | Nov 2, 2023 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| Easy, Interpretable, Effective: openSMILE for voice deepfake detection | Aug 28, 2024 | DeepFake DetectionFace Swapping | —Unverified | 0 | 0 |
| Effective Decoder Masking for Transformer Based End-to-End Speech Recognition | Oct 27, 2020 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition | Mar 31, 2022 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment | Oct 28, 2019 | Hard AttentionSpeech Synthesis | —Unverified | 0 | 0 |
| Efficient data selection employing Semantic Similarity-based Graph Structures for model training | Feb 22, 2024 | Semantic SimilaritySemantic Textual Similarity | —Unverified | 0 | 0 |
| Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Dec 13, 2024 | Conditional Image GenerationImage Generation | —Unverified | 0 | 0 |
| Efficient Incremental Text-to-Speech on GPUs | Nov 25, 2022 | GPUSpeech Synthesis | —Unverified | 0 | 0 |
| Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS | Oct 24, 2022 | Data AugmentationGPU | —Unverified | 0 | 0 |
| Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Oct 9, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Oct 23, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering | Jan 14, 2024 | Audio GenerationLanguage Modeling | —Unverified | 0 | 0 |
| EmoCat: Language-agnostic Emotional Voice Conversion | Jan 14, 2021 | Decodertext-to-speech | —Unverified | 0 | 0 |
| EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance | Nov 17, 2022 | Denoisingtext-to-speech | —Unverified | 0 | 0 |
| Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Sep 16, 2024 | Emotional Speech SynthesisIn-Context Learning | —Unverified | 0 | 0 |
| EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | Dec 9, 2024 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis | Feb 2, 2025 | Self-Supervised LearningSSIM | —Unverified | 0 | 0 |
| Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions | Sep 25, 2024 | AttributeDimensionality Reduction | —Unverified | 0 | 0 |
| Emotional Prosody Control for Speech Generation | Nov 7, 2021 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition | Oct 26, 2020 | Emotion RecognitionSpeech Emotion Recognition | —Unverified | 0 | 0 |
| EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model | Jun 17, 2021 | Emotional Speech SynthesisEmotion Classification | —Unverified | 0 | 0 |
| EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting | Apr 17, 2025 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems | Jan 16, 2022 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems | Jul 1, 2022 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| Emphasis control for parallel neural TTS | Oct 6, 2021 | Sentencetext-to-speech | —Unverified | 0 | 0 |
| Emphasized Accent Phrase Prediction from Text for Advertisement Text-To-Speech Synthesis | Dec 1, 2014 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition | Feb 20, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Apr 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 | 0 |
| EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech | Mar 13, 2024 | GPUSpeech Synthesis | —Unverified | 0 | 0 |
| End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator | Oct 31, 2018 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 | Jan 11, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| End-to-end speech recognition modeling from de-identified data | Jul 12, 2022 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 | 0 |
| End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue | Jun 24, 2022 | text-to-speechText to Speech | —Unverified | 0 | 0 |
| End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning | Apr 13, 2019 | Cross-Lingual Transfertext-to-speech | —Unverified | 0 | 0 |