| Building a Luganda Text-to-Speech Model From Crowdsourced Data | May 16, 2024 | Speech Enhancementtext-to-speech | —Unverified | 0 |
| Faces that Speak: Jointly Synthesising Talking Face and Speech from Text | May 16, 2024 | Code GenerationFace Generation | —Unverified | 0 |
| Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer | May 15, 2024 | Adversarial AttackAutomatic Speech Recognition | —Unverified | 0 |
| PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset | May 14, 2024 | DeepFake DetectionFace Swapping | CodeCode Available | 0 |
| Real-Time Pill Identification for the Visually Impaired Using Deep Learning | May 8, 2024 | Deep LearningManagement | —Unverified | 0 |
| Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech | Apr 30, 2024 | Decodertext-to-speech | —Unverified | 0 |
| TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality | Apr 27, 2024 | Imputationtext-to-speech | —Unverified | 0 |
| StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations | Apr 23, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Retrieval-Augmented Audio Deepfake Detection | Apr 22, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling | Apr 14, 2024 | Polyphone disambiguationText Normalization | —Unverified | 0 |
| Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network | Apr 11, 2024 | Autonomous Vehiclestext-to-speech | —Unverified | 0 |
| The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge | Apr 9, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Cross-Domain Audio Deepfake Detection: Dataset and Analysis | Apr 7, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis | Apr 4, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech | Apr 3, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders | Apr 3, 2024 | Representation LearningSpeaker Verification | —Unverified | 0 |
| Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation | Mar 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| A Review of Multi-Modal Large Language and Vision Models | Mar 28, 2024 | Image CaptioningPrompt Engineering | —Unverified | 0 |
| Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning | Mar 20, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations | Mar 17, 2024 | Attributetext-to-speech | —Unverified | 0 |
| EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech | Mar 13, 2024 | GPUSpeech Synthesis | —Unverified | 0 |
| Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation | Mar 7, 2024 | DiversityMachine Translation | —Unverified | 0 |
| AttentionStitch: How Attention Solves the Speech Editing Problem | Mar 5, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Towards Accurate Lip-to-Speech Synthesis in-the-Wild | Mar 2, 2024 | Language ModellingLip to Speech Synthesis | —Unverified | 0 |
| Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data | Feb 29, 2024 | Representation LearningSpeech Synthesis | —Unverified | 0 |
| Efficient data selection employing Semantic Similarity-based Graph Structures for model training | Feb 22, 2024 | Semantic SimilaritySemantic Textual Similarity | —Unverified | 0 |
| Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition | Feb 22, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models | Feb 19, 2024 | DenoisingImage Generation | —Unverified | 0 |
| Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Feb 18, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech | Feb 14, 2024 | DecoderGPU | —Unverified | 0 |
| Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like | Feb 12, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data | Feb 12, 2024 | DecoderDisentanglement | —Unverified | 0 |
| A New Approach to Voice Authenticity | Feb 9, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations | Feb 5, 2024 | DecoderIn-Context Learning | —Unverified | 0 |
| Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech | Feb 1, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| MunTTS: A Text-to-Speech System for Mundari | Jan 28, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |
| VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech | Jan 25, 2024 | DecoderHallucination | —Unverified | 0 |
| Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization | Jan 23, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Adversarial speech for voice privacy protection from Personalized Speech generation | Jan 22, 2024 | Speaker Verificationtext-to-speech | —Unverified | 0 |
| Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis | Jan 22, 2024 | Speaker VerificationSpeech Synthesis | —Unverified | 0 |
| Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech | Jan 19, 2024 | Self-Supervised Learningtext-to-speech | —Unverified | 0 |
| MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory | Jan 15, 2024 | Music Generationtext-to-speech | —Unverified | 0 |
| ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering | Jan 14, 2024 | Audio GenerationLanguage Modeling | —Unverified | 0 |
| End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2 | Jan 11, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters | Jan 10, 2024 | Self-Supervised LearningSpeech Enhancement | —Unverified | 0 |
| Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments | Jan 7, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Transfer the linguistic representations from TTS to accent conversion with non-parallel data | Jan 7, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction | Jan 3, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Incremental FastPitch: Chunk-based High Quality Text to Speech | Jan 3, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |