| DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | May 6, 2021 | Generative Adversarial NetworkSinging Voice Synthesis | CodeCode Available | 2 |
| Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | Oct 25, 2019 | Generative Adversarial NetworkGPU | CodeCode Available | 2 |
| FastSpeech: Fast,Robustand Controllable Text-to-Speech | May 22, 2019 | Decodertext-to-speech | CodeCode Available | 2 |
| FastSpeech: Fast, Robust and Controllable Text to Speech | May 22, 2019 | DecoderSpeech Synthesis | CodeCode Available | 2 |
| LPCNet: Improving Neural Speech Synthesis Through Linear Prediction | Oct 28, 2018 | PredictionSpeech Synthesis | CodeCode Available | 2 |
| Neural Speech Synthesis with Transformer Network | Sep 19, 2018 | DecoderMachine Translation | CodeCode Available | 2 |
| Efficient Neural Audio Synthesis | Feb 23, 2018 | Audio SynthesisCPU | CodeCode Available | 2 |
| InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Jun 19, 2025 | BenchmarkingDescriptive | CodeCode Available | 1 |
| GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions | Jun 10, 2025 | text-to-speechText to Speech | CodeCode Available | 1 |
| UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information | May 23, 2025 | Large Language ModelQuantization | CodeCode Available | 1 |
| From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | May 22, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | May 21, 2025 | Bayesian OptimizationSpeech Synthesis | CodeCode Available | 1 |
| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |
| Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet | Feb 4, 2025 | Speech Synthesistext-to-speech | CodeCode Available | 1 |
| MathReader : Text-to-Speech for Mathematical Documents | Jan 13, 2025 | Optical Character Recognition (OCR)text-to-speech | CodeCode Available | 1 |
| Mitigating Unauthorized Speech Synthesis for Voice Protection | Oct 28, 2024 | Data AugmentationFace Swapping | CodeCode Available | 1 |
| STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Oct 24, 2024 | Multi-Task Learningspeech-recognition | CodeCode Available | 1 |
| Where are we in audio deepfake detection? A systematic analysis over generative and detection models | Oct 6, 2024 | Audio Deepfake DetectionAudio Synthesis | CodeCode Available | 1 |
| LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation | Sep 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| PRESENT: Zero-Shot Text-to-Prosody Control | Aug 13, 2024 | Prosody PredictionSpeech Synthesis | CodeCode Available | 1 |
| ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features | Aug 3, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech | Jul 17, 2024 | Speech-to-Speech Translationtext-to-speech | CodeCode Available | 1 |
| E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS | Jun 26, 2024 | text-to-speechText to Speech | CodeCode Available | 1 |
| TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers | Jun 22, 2024 | DecoderLanguage Modeling | CodeCode Available | 1 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 |
| XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model | Jun 7, 2024 | text-to-speechText to Speech | CodeCode Available | 1 |
| UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts | Apr 29, 2024 | Contrastive LearningSpeech Synthesis | CodeCode Available | 1 |
| USAT: A Universal Speaker-Adaptive Text-to-Speech Approach | Apr 28, 2024 | Decodertext-to-speech | CodeCode Available | 1 |
| HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks | Apr 6, 2024 | Domain AdaptationSpeech Synthesis | CodeCode Available | 1 |
| KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis | Apr 1, 2024 | Speech Synthesistext-to-speech | CodeCode Available | 1 |
| Brilla AI: AI Contestant for the National Science and Maths Quiz | Mar 4, 2024 | MathQuestion Answering | CodeCode Available | 1 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 |
| Multi-Task Learning for Front-End Text Processing in TTS | Jan 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism | Dec 11, 2023 | Face GenerationLip Reading | CodeCode Available | 1 |
| Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech | Nov 24, 2023 | Dimensionality ReductionEmotion Classification | CodeCode Available | 1 |
| Improving fairness for spoken language understanding in atypical speech with Text-to-Speech | Nov 16, 2023 | Data AugmentationFairness | CodeCode Available | 1 |
| Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning | Nov 7, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| ArTST: Arabic Text and Speech Transformer | Oct 25, 2023 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 |
| Crowdsourced and Automatic Speech Prominence Estimation | Oct 12, 2023 | Emotion Recognitiontext-to-speech | CodeCode Available | 1 |
| Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech | Oct 1, 2023 | speech-recognitionSpeech Recognition | CodeCode Available | 1 |
| BiSinger: Bilingual Singing Voice Synthesis | Sep 25, 2023 | Singing Voice Synthesistext-to-speech | CodeCode Available | 1 |
| Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech | Sep 21, 2023 | text-to-speechText to Speech | CodeCode Available | 1 |
| Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model | Sep 20, 2023 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods | Sep 15, 2023 | Audio Deepfake DetectionDeepFake Detection | CodeCode Available | 1 |
| Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP | Sep 11, 2023 | text-to-speechText to Speech | CodeCode Available | 1 |
| QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning | Aug 31, 2023 | Representation LearningSpeech Representation Learning | CodeCode Available | 1 |
| TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models | Aug 28, 2023 | Language Modellingtext-to-speech | CodeCode Available | 1 |
| Towards an AI to Win Ghana's National Science and Maths Quiz | Aug 8, 2023 | MathQuestion Answering | CodeCode Available | 1 |
| Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation | Aug 3, 2023 | DecoderQuantization | CodeCode Available | 1 |
| DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training | Jul 31, 2023 | DenoisingExpressive Speech Synthesis | CodeCode Available | 1 |