| SeamlessM4T: Massively Multilingual & Multimodal Machine Translation | Aug 22, 2023 | Automatic Speech RecognitionMachine Translation | CodeCode Available | 2 |
| Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models | Aug 21, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis | Aug 16, 2023 | AttributeSpeech Synthesis | —Unverified | 0 |
| SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | Aug 14, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation | Aug 12, 2023 | Talking Head Generationtext-to-speech | CodeCode Available | 0 |
| AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | Aug 10, 2023 | Audio GenerationIn-Context Learning | CodeCode Available | 4 |
| Towards an AI to Win Ghana's National Science and Maths Quiz | Aug 8, 2023 | MathQuestion Answering | CodeCode Available | 1 |
| Let's Give a Voice to Conversational Agents in Virtual Reality | Aug 4, 2023 | Speech-to-Texttext-to-speech | CodeCode Available | 0 |
| Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation | Aug 3, 2023 | DecoderQuantization | CodeCode Available | 1 |
| SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis | Aug 2, 2023 | DecoderSelf-Supervised Learning | —Unverified | 0 |
| Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings | Jul 31, 2023 | Grapheme-to-Phoneme Conversionspeech-recognition | —Unverified | 0 |
| VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design | Jul 31, 2023 | Computational Efficiencytext-to-speech | CodeCode Available | 2 |
| DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training | Jul 31, 2023 | DenoisingExpressive Speech Synthesis | CodeCode Available | 1 |
| Multilingual context-based pronunciation learning for Text-to-Speech | Jul 31, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech | Jul 31, 2023 | Acoustic ModellingSpeech Synthesis | —Unverified | 0 |
| Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation | Jul 30, 2023 | text-to-speechText to Speech | CodeCode Available | 1 |
| METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer | Jul 29, 2023 | DisentanglementDiversity | —Unverified | 0 |
| ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus | Jul 29, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding | Jul 28, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer | Jul 20, 2023 | Expressive Speech SynthesisLanguage Modelling | CodeCode Available | 1 |
| SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs | Jul 18, 2023 | Generative Adversarial NetworkLanguage Modeling | —Unverified | 0 |
| Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis | Jul 14, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Controllable Emphasis with zero data for text-to-speech | Jul 13, 2023 | Sentencetext-to-speech | —Unverified | 0 |
| On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis | Jul 11, 2023 | PredictionSelf-Supervised Learning | —Unverified | 0 |
| Artificial Eye for the Blind | Jul 7, 2023 | Objectobject-detection | —Unverified | 0 |
| Text + Sketch: Image Compression at Ultra Low Rates | Jul 4, 2023 | Image CompressionText to Speech | CodeCode Available | 1 |
| ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading | Jul 3, 2023 | FormSentence | —Unverified | 0 |
| High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units | Jun 29, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech | Jun 28, 2023 | Emotion RecognitionSpeech Synthesis | CodeCode Available | 1 |
| GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech | Jun 27, 2023 | DisentanglementStyle Generalization | —Unverified | 0 |
| DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech | Jun 25, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | Jun 23, 2023 | In-Context LearningSpeech Synthesis | CodeCode Available | 0 |
| Visual-Aware Text-to-Speech | Jun 21, 2023 | RhythmSpeech Synthesis | —Unverified | 0 |
| Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer | Jun 20, 2023 | text-to-speechText to Speech | —Unverified | 0 |
| Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation | Jun 16, 2023 | Data Augmentationtext-to-speech | —Unverified | 0 |
| CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages | Jun 16, 2023 | Speech Synthesistext-to-speech | —Unverified | 0 |
| Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects | Jun 14, 2023 | Recommendation Systemstext-to-speech | CodeCode Available | 1 |
| Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation | Jun 14, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models | Jun 13, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 5 |
| PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling | Jun 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech | Jun 9, 2023 | Emotion RecognitionSpeech Emotion Recognition | —Unverified | 0 |
| VIFS: An End-to-End Variational Inference for Foley Sound Synthesis | Jun 8, 2023 | Speech Synthesistext-to-speech | CodeCode Available | 0 |
| Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis | Jun 6, 2023 | Neural Renderingtext-to-speech | —Unverified | 0 |
| Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias | Jun 6, 2023 | AttributeInductive Bias | —Unverified | 0 |
| Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model | Jun 5, 2023 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 |
| Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming | Jun 5, 2023 | Bayesian InferenceSinging Voice Synthesis | CodeCode Available | 0 |
| Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis | Jun 5, 2023 | RhythmSentence | —Unverified | 0 |
| Towards Robust FastSpeech 2 by Modelling Residual Multimodality | Jun 2, 2023 | Decodertext-to-speech | —Unverified | 0 |
| The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech | Jun 1, 2023 | Cross-Lingual Transfertext-to-speech | —Unverified | 0 |
| XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech | May 31, 2023 | text-to-speechText to Speech | CodeCode Available | 5 |