| mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition | Feb 3, 2025 | Audio-Visual Speech RecognitionDecoder | CodeCode Available | 3 | 5 |
| Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | Jun 14, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 3 | 5 |
| CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization | May 6, 2025 | Active Speaker DetectionAudio-Visual Speech Recognition | CodeCode Available | 2 | 5 |
| Large Language Models are Strong Audio-Visual Speech Recognition Learners | Sep 18, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 2 | 5 |
| MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation | Mar 1, 2023 | Audio-Visual Speech RecognitionRobust Speech Recognition | CodeCode Available | 2 | 5 |
| Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels | Mar 25, 2023 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 2 | 5 |
| Robust Self-Supervised Audio-Visual Speech Recognition | Jan 5, 2022 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 2 | 5 |
| Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation | Jan 23, 2025 | Audio-Visual Speech RecognitionMulti-Task Learning | CodeCode Available | 1 | 5 |
| Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition | Jul 4, 2024 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |
| Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | Feb 24, 2022 | Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition | May 16, 2023 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder | Aug 14, 2023 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models | Feb 9, 2025 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information | Jun 4, 2023 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |
| It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition | Feb 8, 2024 | Audio-Visual Speech RecognitionAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens | Mar 14, 2025 | Audio-Visual Speech RecognitionComputational Efficiency | CodeCode Available | 1 | 5 |
| Discriminative Multi-modality Speech Recognition | May 12, 2020 | Audio-Visual Speech RecognitionLipreading | CodeCode Available | 1 | 5 |
| End-to-end Audio-visual Speech Recognition with Conformers | Feb 12, 2021 | Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition | Jan 11, 2022 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |
| CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command Recognition | Jun 1, 2022 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |
| How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition | Apr 17, 2020 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |
| Jointly Learning Visual and Auditory Speech Representations from Raw Data | Dec 12, 2022 | Audio-Visual Speech RecognitionLipreading | CodeCode Available | 1 | 5 |
| AV Taris: Online Audio-Visual Speech Recognition | Dec 14, 2020 | Action DetectionActivity Detection | CodeCode Available | 1 | 5 |
| Deep Audio-Visual Speech Recognition | Sep 6, 2018 | Audio-Visual Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition | Jun 18, 2023 | Audio-Visual Speech Recognitionspeech-recognition | CodeCode Available | 1 | 5 |