| Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement | Jan 15, 2025 | Computational EfficiencyCPU | —Unverified | 0 |
| AI-Powered Assistive Technologies for Visual Impairment | Jan 14, 2025 | Object Recognitiontext-to-speech | —Unverified | 0 |
| MathReader : Text-to-Speech for Mathematical Documents | Jan 13, 2025 | Optical Character Recognition (OCR)text-to-speech | CodeCode Available | 1 |
| PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control | Jan 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Jan 10, 2025 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| MinMo: A Multimodal Large Language Model for Seamless Voice Interaction | Jan 10, 2025 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron | Jan 10, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model | Jan 10, 2025 | DecoderLanguage Modelling | —Unverified | 0 |
| Probing Speaker-specific Features in Speaker Representations | Jan 9, 2025 | Self-Supervised LearningSpeaker Verification | —Unverified | 0 |
| Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model | Jan 8, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | Jan 2, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT | Jan 2, 2025 | Polyphone disambiguationSentence | —Unverified | 0 |
| RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer | Jan 2, 2025 | Audio Generationtext-to-speech | CodeCode Available | 2 |
| Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting | Dec 28, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities | Dec 26, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 0 |
| Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID | Dec 26, 2024 | Language Identificationtext-to-speech | —Unverified | 0 |
| Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset | Dec 25, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis | Dec 22, 2024 | DecoderDisentanglement | —Unverified | 0 |
| Autoregressive Speech Synthesis with Next-Distribution Prediction | Dec 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective | Dec 22, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers | Dec 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling | Dec 19, 2024 | AttributeSpeech Enhancement | —Unverified | 0 |
| Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion | Dec 17, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes | Dec 17, 2024 | DeepFake DetectionFace Swapping | —Unverified | 0 |
| ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis | Dec 16, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech | Dec 16, 2024 | text-to-speechText to Speech | CodeCode Available | 0 |
| Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Dec 13, 2024 | Conditional Image GenerationImage Generation | —Unverified | 0 |
| AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | Dec 13, 2024 | Data AugmentationSarcasm Detection | —Unverified | 0 |
| CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder | Dec 12, 2024 | Audio SynthesisSinging Voice Synthesis | —Unverified | 0 |
| A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction | Dec 11, 2024 | DecoderSelf-Supervised Learning | —Unverified | 0 |
| LatentSpeech: Latent Diffusion for Text-To-Speech Generation | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Multimodal Latent Language Modeling with Next-Token Diffusion | Dec 11, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 0 |
| Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey | Dec 9, 2024 | Speech SynthesisSurvey | CodeCode Available | 3 |
| EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | Dec 9, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles | Dec 4, 2024 | Prosody Predictiontext-to-speech | —Unverified | 0 |
| GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Dec 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 7 |
| Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor | Dec 1, 2024 | AllNatural Language Understanding | —Unverified | 0 |
| SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Nov 27, 2024 | Question AnsweringSpeech Enhancement | —Unverified | 0 |
| Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Nov 27, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis | Nov 26, 2024 | Decodermultimodal generation | —Unverified | 0 |
| Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Nov 20, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| A Context-Based Numerical Format Prediction for a Text-To-Speech System | Nov 19, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D | Nov 19, 2024 | Speech-to-Texttext-to-speech | —Unverified | 0 |
| Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation | Nov 19, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| WavChat: A Survey of Spoken Dialogue Models | Nov 15, 2024 | speech-recognitionSpeech Recognition | CodeCode Available | 3 |
| Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models | Nov 12, 2024 | Grapheme-to-Phoneme ConversionRetrieval | —Unverified | 0 |
| Debatts: Zero-Shot Debating Text-to-Speech Synthesis | Nov 10, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR | Nov 7, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |