| Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE | Jun 6, 2022 | Representation LearningSpeech Representation Learning | —Unverified | 0 |
| UzbekTagger: The rule-based POS tagger for Uzbek language | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages | May 21, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers | Jun 8, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment | Jun 12, 2024 | QuantizationSpeech Synthesis | —Unverified | 0 |
| VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech | Jan 25, 2024 | DecoderHallucination | —Unverified | 0 |
| VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention | Feb 12, 2021 | Speech Synthesistext-to-speech | —Unverified | 0 |
| 可變速中文文字轉語音系統 (Variable Speech Rate Mandarin Chinese Text-to-Speech System) [In Chinese] | Mar 1, 2012 | text-to-speechText to Speech | —Unverified | 0 |
| Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow | Feb 27, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech | Jun 12, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Vers une annotation automatique de corpus audio pour la synth\`ese de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French] | Jun 1, 2012 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement | Feb 11, 2025 | Disentanglementtext-to-speech | —Unverified | 0 |
| ViDA-MAN: Visual Dialog with Digital Humans | Oct 26, 2021 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| Vietnamese Text-To-Speech Shared Task VLSP 2020: Remaining problems with state-of-the-art techniques | Dec 1, 2020 | text-to-speechText to Speech | —Unverified | 0 |
| VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation | May 25, 2023 | DecoderLanguage Modeling | —Unverified | 0 |
| Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech | Oct 27, 2022 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis | Nov 26, 2024 | Decodermultimodal generation | —Unverified | 0 |
| Visual-Aware Text-to-Speech | Jun 21, 2023 | RhythmSpeech Synthesis | —Unverified | 0 |
| VisualSpeech: Enhance Prosody with Visual Context in TTS | Jan 31, 2025 | Prosody Predictiontext-to-speech | —Unverified | 0 |
| VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over | Oct 7, 2021 | Speech Synthesistext-to-speech | —Unverified | 0 |
| ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer | May 22, 2023 | DecoderDenoising | —Unverified | 0 |
| Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise | Mar 20, 2022 | text-to-speechText to Speech | —Unverified | 0 |
| VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection | Mar 10, 2025 | NVIDIA Jetson Orin Nanoobject-detection | —Unverified | 0 |
| Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network | Apr 11, 2024 | Autonomous Vehiclestext-to-speech | —Unverified | 0 |
| Voice Builder: A Tool for Building Text-To-Speech Voices | May 1, 2018 | text-to-speechText to Speech | —Unverified | 0 |
| Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning | Feb 10, 2021 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer | Sep 3, 2020 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module | Feb 16, 2022 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Voice Imitating Text-to-Speech Neural Networks | Jun 4, 2018 | Sentencetext-to-speech | —Unverified | 0 |
| VoiceLDM: Text-to-Speech with Environmental Context | Sep 24, 2023 | AudioCapstext-to-speech | —Unverified | 0 |
| VoiceWukong: Benchmarking Deepfake Voice Detection | Sep 10, 2024 | BenchmarkingFace Swapping | —Unverified | 0 |
| Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech | May 21, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka | Sep 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Aug 11, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature | Apr 2, 2022 | Speech Synthesistext-to-speech | —Unverified | 0 |
| VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications | May 19, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes | Nov 29, 2023 | Face RecognitionFace Swapping | —Unverified | 0 |
| Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder | Jul 31, 2018 | Generative Adversarial NetworkSpeech Synthesis | —Unverified | 0 |
| Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks | Oct 30, 2018 | Image GenerationSpeech Synthesis | —Unverified | 0 |
| WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss | Feb 2, 2020 | text-to-speechText to Speech | —Unverified | 0 |
| Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis | Mar 24, 2023 | Generative Adversarial NetworkSpeech Synthesis | —Unverified | 0 |
| WavThruVec: Latent speech representation as intermediate features for neural speech synthesis | Mar 31, 2022 | Speech Synthesistext-to-speech | —Unverified | 0 |
| WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing | Jun 2, 2025 | Keyword Spottingspeech-recognition | —Unverified | 0 |
| Weakly-supervised text-to-speech alignment confidence measure | Dec 1, 2016 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| Werewolf: A Straightforward Game Framework with TTS for Improved User Engagement | May 30, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| What happens to diffusion model likelihood when your model is conditional? | Sep 10, 2024 | domain classificationmodel | —Unverified | 0 |
| What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS | Sep 4, 2020 | DecoderSentence | —Unverified | 0 |
| What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection | May 23, 2025 | Face SwappingSensitivity | —Unverified | 0 |
| Whispered and Lombard Neural Speech Synthesis | Jan 13, 2021 | Speaker VerificationSpeech Synthesis | —Unverified | 0 |
| Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective | Dec 22, 2024 | text-to-speechText to Speech | —Unverified | 0 |