| TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Jan 10, 2025 | speech-recognitionSpeech Recognition | —Unverified | 0 |
| Probing Speaker-specific Features in Speaker Representations | Jan 9, 2025 | Self-Supervised LearningSpeaker Verification | —Unverified | 0 |
| Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model | Jan 8, 2025 | text-to-speechText to Speech | —Unverified | 0 |
| FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | Jan 2, 2025 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT | Jan 2, 2025 | Polyphone disambiguationSentence | —Unverified | 0 |
| Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting | Dec 28, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID | Dec 26, 2024 | Language Identificationtext-to-speech | —Unverified | 0 |
| "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities | Dec 26, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 0 |
| Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset | Dec 25, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis | Dec 22, 2024 | DecoderDisentanglement | —Unverified | 0 |
| Autoregressive Speech Synthesis with Next-Distribution Prediction | Dec 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective | Dec 22, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers | Dec 20, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling | Dec 19, 2024 | AttributeSpeech Enhancement | —Unverified | 0 |
| Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes | Dec 17, 2024 | DeepFake DetectionFace Swapping | —Unverified | 0 |
| Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion | Dec 17, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis | Dec 16, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech | Dec 16, 2024 | text-to-speechText to Speech | CodeCode Available | 0 |
| Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Dec 13, 2024 | Conditional Image GenerationImage Generation | —Unverified | 0 |
| AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | Dec 13, 2024 | Data AugmentationSarcasm Detection | —Unverified | 0 |
| CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder | Dec 12, 2024 | Audio SynthesisSinging Voice Synthesis | —Unverified | 0 |
| A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Multimodal Latent Language Modeling with Next-Token Diffusion | Dec 11, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 0 |
| A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction | Dec 11, 2024 | DecoderSelf-Supervised Learning | —Unverified | 0 |
| LatentSpeech: Latent Diffusion for Text-To-Speech Generation | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration | Dec 11, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations | Dec 9, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles | Dec 4, 2024 | Prosody Predictiontext-to-speech | —Unverified | 0 |
| Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor | Dec 1, 2024 | AllNatural Language Understanding | —Unverified | 0 |
| SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Nov 27, 2024 | Question AnsweringSpeech Enhancement | —Unverified | 0 |
| Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Nov 27, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis | Nov 26, 2024 | Decodermultimodal generation | —Unverified | 0 |
| Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Nov 20, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| A Context-Based Numerical Format Prediction for a Text-To-Speech System | Nov 19, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D | Nov 19, 2024 | Speech-to-Texttext-to-speech | —Unverified | 0 |
| Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation | Nov 19, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models | Nov 12, 2024 | Grapheme-to-Phoneme ConversionRetrieval | —Unverified | 0 |
| Debatts: Zero-Shot Debating Text-to-Speech Synthesis | Nov 10, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR | Nov 7, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Oct 31, 2024 | Rhythmspeech-recognition | —Unverified | 0 |
| Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech | Oct 29, 2024 | Decodertext-to-speech | CodeCode Available | 0 |
| Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding | Oct 29, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis | Oct 29, 2024 | DenoisingSinging Voice Synthesis | —Unverified | 0 |
| Asynchronous Tool Usage for Real-Time Agents | Oct 28, 2024 | Automatic Speech Recognitionspeech-recognition | —Unverified | 0 |
| Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation | Oct 27, 2024 | parameter-efficient fine-tuningQuestion Answering | —Unverified | 0 |
| Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | Oct 24, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | Oct 24, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Oct 23, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Oct 22, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Continuous Speech Tokenizer in Text To Speech | Oct 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |