| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 | 0 |
| Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels | Mar 21, 2024 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 | 0 |
| Read, Look or Listen? What's Needed for Solving a Multimodal Dataset | Jul 6, 2023 | Question AnsweringSpeaker Identification | —Unverified | 0 | 0 |
| ReasVQA: Advancing VideoQA with Imperfect Reasoning Process | Jan 23, 2025 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| Recent Advances in Video Question Answering: A Review of Datasets and Methods | Jan 15, 2021 | Information RetrievalMachine Translation | —Unverified | 0 | 0 |
| Redundancy-aware Transformer for Video Question Answering | Aug 7, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Answering with Iterative Video-Text Co-Tokenization | Aug 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models | Apr 18, 2024 | GSM8KMMLU | —Unverified | 0 | 0 |
| CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising | Dec 14, 2021 | Cross-Modal RetrievalDecoder | —Unverified | 0 | 0 |
| Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives | Apr 25, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Retrieval-based Video Language Model for Efficient Long Video Question Answering | Dec 8, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models | Jun 15, 2023 | cross-modal alignmentDomain Generalization | —Unverified | 0 | 0 |
| Co-attentional Transformers for Story-Based Video Understanding | Oct 27, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Answering with Phrases via Semantic Roles | Apr 8, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Question Generation via Cross-Modal Self-Attention Networks Learning | Jul 5, 2019 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Nov 19, 2024 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Oct 10, 2024 | Conformal PredictionLanguage Modeling | —Unverified | 0 | 0 |
| Zero-Shot Long-Form Video Understanding through Screenplay | Jun 25, 2024 | FormQuestion Answering | —Unverified | 0 | 0 |
| VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners | Dec 9, 2022 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| SEAL: Semantic Attention Learning for Long Video Representation | Dec 2, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 | 0 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Self-supervised pre-training and contrastive representation learning for multiple-choice video QA | Sep 17, 2020 | Auxiliary LearningContrastive Learning | —Unverified | 0 | 0 |
| Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering | May 14, 2023 | Question AnsweringSemantic Role Labeling | —Unverified | 0 | 0 |