| Faces that Speak: Jointly Synthesising Talking Face and Speech from Text | May 16, 2024 | Code GenerationFace Generation | —Unverified | 0 |
| Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer | May 15, 2024 | Adversarial AttackAutomatic Speech Recognition | —Unverified | 0 |
| PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset | May 14, 2024 | DeepFake DetectionFace Swapping | CodeCode Available | 0 |
| Real-Time Pill Identification for the Visually Impaired Using Deep Learning | May 8, 2024 | Deep LearningManagement | —Unverified | 0 |
| Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech | Apr 30, 2024 | Decodertext-to-speech | —Unverified | 0 |
| UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts | Apr 29, 2024 | Contrastive LearningSpeech Synthesis | CodeCode Available | 1 |
| USAT: A Universal Speaker-Adaptive Text-to-Speech Approach | Apr 28, 2024 | Decodertext-to-speech | CodeCode Available | 1 |
| TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality | Apr 27, 2024 | Imputationtext-to-speech | —Unverified | 0 |
| StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations | Apr 23, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Retrieval-Augmented Audio Deepfake Detection | Apr 22, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling | Apr 14, 2024 | Polyphone disambiguationText Normalization | —Unverified | 0 |
| Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network | Apr 11, 2024 | Autonomous Vehiclestext-to-speech | —Unverified | 0 |
| CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations | Apr 10, 2024 | Dialogue Generationtext-to-speech | CodeCode Available | 2 |
| Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness | Apr 10, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 2 |
| The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge | Apr 9, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Cross-Domain Audio Deepfake Detection: Dataset and Analysis | Apr 7, 2024 | Audio Deepfake DetectionDeepFake Detection | —Unverified | 0 |
| HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks | Apr 6, 2024 | Domain AdaptationSpeech Synthesis | CodeCode Available | 1 |
| RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis | Apr 4, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech | Apr 3, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders | Apr 3, 2024 | Representation LearningSpeaker Verification | —Unverified | 0 |
| KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis | Apr 1, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 1 |
| CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models | Mar 31, 2024 | DenoisingSpeech Synthesis | CodeCode Available | 2 |
| Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation | Mar 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| A Review of Multi-Modal Large Language and Vision Models | Mar 28, 2024 | Image CaptioningPrompt Engineering | —Unverified | 0 |
| VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild | Mar 25, 2024 | DecoderLanguage Modeling | CodeCode Available | 9 |
| Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning | Mar 20, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations | Mar 17, 2024 | Attributetext-to-speech | —Unverified | 0 |
| EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech | Mar 13, 2024 | GPUSpeech Synthesis | —Unverified | 0 |
| Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation | Mar 7, 2024 | DiversityMachine Translation | —Unverified | 0 |
| AttentionStitch: How Attention Solves the Speech Editing Problem | Mar 5, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models | Mar 5, 2024 | QuantizationSpeech Synthesis | CodeCode Available | 3 |
| Brilla AI: AI Contestant for the National Science and Maths Quiz | Mar 4, 2024 | MathQuestion Answering | CodeCode Available | 1 |
| Towards Accurate Lip-to-Speech Synthesis in-the-Wild | Mar 2, 2024 | Language ModellingLip to Speech Synthesis | —Unverified | 0 |
| Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data | Feb 29, 2024 | Representation LearningSpeech Synthesis | —Unverified | 0 |
| An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation | Feb 26, 2024 | Dataset Generationtext-to-speech | CodeCode Available | 2 |
| Efficient data selection employing Semantic Similarity-based Graph Structures for model training | Feb 22, 2024 | Semantic SimilaritySemantic Textual Similarity | —Unverified | 0 |
| Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition | Feb 22, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models | Feb 19, 2024 | DenoisingImage Generation | —Unverified | 0 |
| Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru | Feb 18, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech | Feb 14, 2024 | DecoderGPU | —Unverified | 0 |
| BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data | Feb 12, 2024 | DecoderDisentanglement | —Unverified | 0 |
| Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like | Feb 12, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| A New Approach to Voice Authenticity | Feb 9, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation | Feb 8, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 2 |
| Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations | Feb 5, 2024 | DecoderIn-Context Learning | —Unverified | 0 |
| Natural language guidance of high-fidelity text-to-speech with synthetic annotations | Feb 2, 2024 | In-Context LearningLanguage Modeling | CodeCode Available | 9 |
| PAM: Prompting Audio-Language Models for Audio Quality Assessment | Feb 1, 2024 | Audio Quality AssessmentMusic Generation | CodeCode Available | 2 |
| Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech | Feb 1, 2024 | text-to-speechText to Speech | —Unverified | 0 |
| MunTTS: A Text-to-Speech System for Mundari | Jan 28, 2024 | Speech Synthesistext-to-speech | —Unverified | 0 |