| End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling | Jul 21, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| LongVILA: Scaling Long-Context Visual Language Models for Long Videos | Aug 19, 2024 | Video CaptioningVideo Question Answering | —Unverified | 0 | 0 |
| Efficient Motion-Aware Video MLLM | Jan 1, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 | 0 |
| MarioQA: Answering Questions by Watching Gameplay Videos | Dec 6, 2016 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Measuring Compositional Consistency for Video Question Answering | Apr 14, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation | May 4, 2023 | DecoderQuestion Answering | —Unverified | 0 | 0 |
| VideoOrion: Tokenizing Object Dynamics in Videos | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | Nov 9, 2023 | Action ClassificationAudio Classification | —Unverified | 0 | 0 |
| AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction | Jan 1, 2025 | GPUQuestion Answering | —Unverified | 0 | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 | 0 |
| MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | Oct 6, 2023 | counterfactualQuestion Answering | —Unverified | 0 | 0 |
| Modality Alignment between Deep Representations for Effective Video-and-Language Learning | Jun 1, 2022 | Question AnsweringVideo Captioning | —Unverified | 0 | 0 |
| Modality Shifting Attention Network for Multi-modal Video Question Answering | Jul 4, 2020 | Question AnsweringTemporal Localization | —Unverified | 0 | 0 |
| Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering | May 13, 2022 | Question AnsweringSemantic Composition | —Unverified | 0 | 0 |
| Modular Blended Attention Network for Video Question Answering | Nov 2, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| MoReVQA: Exploring Modular Reasoning Models for Video Question Answering | Apr 9, 2024 | EgoSchemaMultiple-choice | —Unverified | 0 | 0 |
| Motion-Appearance Co-Memory Networks for Video Question Answering | Mar 29, 2018 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering | Aug 11, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Distraction-free Embeddings for Robust VQA | Aug 31, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents | Apr 25, 2018 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding | Dec 8, 2023 | FormQuestion Answering | —Unverified | 0 | 0 |
| Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |