| GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM | Apr 15, 2025 | QuantizationReading Comprehension | —Unverified | 0 |
| Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis | Apr 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis | Apr 14, 2025 | RAGRetrieval-augmented Generation | —Unverified | 0 |
| Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation | Apr 11, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow | Apr 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Apr 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation | Apr 7, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| Speculative End-Turn Detector for Efficient Speech Chatbot Assistant | Mar 30, 2025 | ChatbotCollaborative Inference | —Unverified | 0 |
| SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System | Mar 29, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation | Mar 28, 2025 | Audio GenerationAudio-Visual Synchronization | —Unverified | 0 |
| Dual Audio-Centric Modality Coupling for Talking Head Generation | Mar 26, 2025 | NeRFTalking Head Generation | —Unverified | 0 |
| Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication | Mar 21, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation | Mar 14, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR | Mar 11, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection | Mar 10, 2025 | NVIDIA Jetson Orin Nanoobject-detection | —Unverified | 0 |
| InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training | Mar 4, 2025 | Instruction Followingtext-to-speech | —Unverified | 0 |
| Direct Speech to Speech Translation: A Review | Mar 3, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation | Mar 2, 2025 | DecoderRepresentation Learning | —Unverified | 0 |
| Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale | Feb 27, 2025 | AI AgentLarge Language Model | —Unverified | 0 |
| Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding | Feb 26, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis | Feb 26, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision | Feb 26, 2025 | Audio SynthesisAutomatic Speech Recognition | —Unverified | 0 |
| Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM | Feb 24, 2025 | Automatic Speech RecognitionLanguage Modeling | —Unverified | 0 |
| NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing | Feb 17, 2025 | Lip to Speech Synthesisspeech-recognition | —Unverified | 0 |
| SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer | Feb 16, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech | Feb 13, 2025 | Adversarial AttackAdversarial Attack Detection | —Unverified | 0 |
| Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement | Feb 11, 2025 | Disentanglementtext-to-speech | —Unverified | 0 |
| LoRP-TTS: Low-Rank Personalized Text-To-Speech | Feb 11, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Synthetic Audio Helps for Cognitive State Tasks | Feb 10, 2025 | text-to-speechText to Speech | CodeCode Available | 0 |
| Speech to Speech Translation with Translatotron: A State of the Art Review | Feb 9, 2025 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| Gender Bias in Instruction-Guided Speech Synthesis Models | Feb 8, 2025 | Expressive Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech | Feb 5, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation | Feb 4, 2025 | Change DetectionGender Classification | —Unverified | 0 |
| EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis | Feb 2, 2025 | Self-Supervised LearningSSIM | —Unverified | 0 |
| VisualSpeech: Enhance Prosody with Visual Context in TTS | Jan 31, 2025 | Prosody Predictiontext-to-speech | —Unverified | 0 |
| BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights | Jan 29, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Compact Neural TTS Voices for Accessibility | Jan 28, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models | Jan 24, 2025 | Emotion ClassificationSpeaker Identification | —Unverified | 0 |
| Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation | Jan 24, 2025 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| LoCoML: A Framework for Real-World ML Inference Pipelines | Jan 24, 2025 | Automatic Speech RecognitionMachine Translation | —Unverified | 0 |
| Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement | Jan 23, 2025 | Data AugmentationSpeech Enhancement | —Unverified | 0 |
| Development of an Inclusive Educational Platform Using Open Technologies and Machine Learning: A Case Study on Accessibility Enhancement | Jan 22, 2025 | Object Recognitionspeech-recognition | —Unverified | 0 |
| A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data | Jan 21, 2025 | Domain Adaptationspeech-recognition | —Unverified | 0 |
| Speech Synthesis along Perceptual Voice Quality Dimensions | Jan 15, 2025 | Expressive Speech SynthesisSpeech Synthesis | —Unverified | 0 |
| Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement | Jan 15, 2025 | Computational EfficiencyCPU | —Unverified | 0 |
| AI-Powered Assistive Technologies for Visual Impairment | Jan 14, 2025 | Object Recognitiontext-to-speech | —Unverified | 0 |
| MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model | Jan 10, 2025 | DecoderLanguage Modelling | —Unverified | 0 |
| PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control | Jan 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Jan 10, 2025 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron | Jan 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |