SOTAVerified

Automatic Speech Recognition

Papers

Showing 150 of 3174 papers

TitleStatusHype
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-trainingCode11
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken ChatbotCode7
Scaling Speech-Text Pre-training with Synthetic Interleaved DataCode7
OxfordVGG Submission to the EGO4D AV Transcription ChallengeCode6
FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM IntegrationCode5
GigaAM: Efficient Self-Supervised Learner for Speech RecognitionCode4
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language ModelCode4
Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern LanguagesCode4
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition EvaluationCode4
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-PlayCode3
VoiceBench: Benchmarking LLM-Based Voice AssistantsCode3
WhisperNER: Unified Open Named Entity and Speech RecognitionCode3
MooER: LLM-based Speech Recognition and Translation Models from Moore ThreadsCode3
Sentiment Reasoning for HealthcareCode3
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation ModelsCode3
PhoWhisper: Automatic Speech Recognition for VietnameseCode3
DiarizationLM: Speaker Diarization Post-Processing with Large Language ModelsCode3
SALMONN: Towards Generic Hearing Abilities for Large Language ModelsCode3
Delay-penalized transducer for low-latency streaming ASRCode3
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden IntermediatesCode3
A Parallelizable Lattice Rescoring Strategy with Neural Language ModelsCode3
Conformer: Convolution-augmented Transformer for Speech RecognitionCode3
TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptationCode3
MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech EnhancementCode2
CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASRCode2
LiteASR: Efficient Automatic Speech Recognition with Low-Rank ApproximationCode2
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech RecognitionCode2
Streaming Keyword Spotting Boosted by Cross-layer Discrimination ConsistencyCode2
Dialectal Coverage And Generalization in Arabic Speech RecognitionCode2
emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface ElectromyographyCode2
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU LanguagesCode2
Recent Advances in Speech Language Models: A SurveyCode2
Large Language Models are Strong Audio-Visual Speech Recognition LearnersCode2
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile InstructionsCode2
wav2graph: A Framework for Supervised Learning Knowledge Graph from SpeechCode2
Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic UnitsCode2
Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text RecognitionCode2
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker RecordingsCode2
An Embarrassingly Simple Approach for LLM with Strong ASR CapacityCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural ConversationCode2
Large Language Models are Efficient Learners of Noise-Robust Speech RecognitionCode2
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech RecognitionCode2
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTCode2
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR ModelsCode2
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech CodecCode2
PromptASR for contextualized ASR with controllable styleCode2
SeamlessM4T: Massively Multilingual & Multimodal Machine TranslationCode2
Auto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsCode2
Stabilizing Transformer Training by Preventing Attention Entropy CollapseCode2
Show:102550
← PrevPage 1 of 64Next →

No leaderboard results yet.