CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Jul 7, 2024 Language Modelling Large Language Model
Code Code Available 115 FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Jul 4, 2024 Emotion Recognition Event Detection
Code Code Available 115 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training May 23, 2025 Automatic Speech Recognition Emotion Recognition
Code Code Available 115 Moonshine: Speech Recognition for Live Transcription and Voice Commands Oct 21, 2024 Decoder Position
Code Code Available 95 Moshi: a speech-text foundation model for real-time dialogue Sep 17, 2024 Action Detection Activity Detection
Code Code Available 95 Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition Jul 17, 2023 Decoder Language Modeling
Code Code Available 85 Robust Speech Recognition via Large-Scale Weak Supervision Dec 6, 2022 Robust Speech Recognition speech-recognition
Code Code Available 85 GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Dec 3, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 75 Scaling Speech-Text Pre-training with Synthetic Interleaved Data Nov 26, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 75 Qwen2.5-Omni Technical Report Mar 26, 2025 Automatic Speech Recognition (ASR) GSM8K
Code Code Available 75 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant Oct 20, 2024 Question Answering speech-recognition
Code Code Available 75 Kimi-Audio Technical Report Apr 25, 2025 Audio Question Answering Question Answering
Code Code Available 75 Speechless: Speech Instruction Training Without Speech for Low Resource Languages May 23, 2025 speech-recognition Speech Recognition
Code Code Available 75 PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit May 20, 2022 All Automatic Speech Recognition (ASR)
Code Code Available 65 OxfordVGG Submission to the EGO4D AV Transcription Challenge Jul 18, 2023 Automatic Speech Recognition speech-recognition
Code Code Available 65 StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Jun 5, 2024 Automatic Speech Recognition (ASR) de-en
Code Code Available 55 FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration Jan 24, 2025 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 55 WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit Mar 29, 2022 Decoder Language Modelling
Code Code Available 55 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model May 6, 2025 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 45 The Llama 3 Herd of Models Jul 31, 2024 answerability prediction Language Modeling
Code Code Available 45 Turning Whisper into Real-Time Transcription System Jul 27, 2023 speech-recognition Speech Recognition
Code Code Available 45 GigaAM: Efficient Self-Supervised Learner for Speech Recognition Jun 1, 2025 Automatic Speech Recognition Language Modeling
Code Code Available 45 Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System May 17, 2024 Data Augmentation Speech Dereverberation
Code Code Available 45 Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages Mar 26, 2025 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 45 CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions Aug 29, 2024 Dynamic Time Warping speech-recognition
Code Code Available 45 SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Mar 13, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 45 Multi-head Temporal Latent Attention May 19, 2025 GPU speech-recognition
Code Code Available 45 TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch Oct 27, 2023 Self-Supervised Learning Speech Enhancement
Code Code Available 45 A Survey on Vision-Language-Action Models for Embodied AI May 23, 2024 Image Captioning Instruction Following
Code Code Available 45 Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling Nov 1, 2023 Hallucination Knowledge Distillation
Code Code Available 45 VoiceBench: Benchmarking LLM-Based Voice Assistants Oct 22, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement Jun 17, 2024 speech-recognition Speech Recognition
Code Code Available 35 A Parallelizable Lattice Rescoring Strategy with Neural Language Models Mar 8, 2021 ARC Automatic Speech Recognition
Code Code Available 35 Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates Sep 27, 2021 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation May 12, 2018 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play May 5, 2025 AI Agent Automatic Speech Recognition
Code Code Available 35 Semi-Supervised Speech Recognition via Local Prior Matching Feb 24, 2020 Knowledge Distillation Language Modeling
Code Code Available 35 DiarizationLM: Speaker Diarization Post-Processing with Large Language Models Jan 7, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 Delay-penalized transducer for low-latency streaming ASR Oct 31, 2022 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 Datasets: A Community Library for Natural Language Processing Sep 7, 2021 Image Classification Object Recognition
Code Code Available 35 SALMONN: Towards Generic Hearing Abilities for Large Language Models Oct 20, 2023 Audio captioning Automatic Speech Recognition
Code Code Available 35 Sentiment Reasoning for Healthcare Jul 24, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities May 18, 2023 Language Modeling Language Modelling
Code Code Available 35 PhoWhisper: Automatic Speech Recognition for Vietnamese Mar 27, 2024 Automatic Speech Recognition speech-recognition
Code Code Available 35 OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia Jan 23, 2025 Emotion Recognition Event Detection
Code Code Available 35 Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models May 23, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 Conformer: Convolution-augmented Transformer for Speech Recognition May 16, 2020 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Nov 14, 2023 Acoustic Scene Classification Audio captioning
Code Code Available 35 MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Aug 9, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 35 mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition Feb 3, 2025 Audio-Visual Speech Recognition Decoder
Code Code Available 35