| Equivariant and Invariant Grounding for Video Question Answering | Jul 26, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | Sep 4, 2022 | Fill MaskOptical Flow Estimation | CodeCode Available | 1 | 5 |
| BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning | Sep 27, 2023 | GPUVideo-based Generative Performance Benchmarking | CodeCode Available | 1 | 5 |
| On the hidden treasure of dialog in video question answering | Mar 26, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| RTQ: Rethinking Video-language Understanding Based on Image-text Model | Dec 1, 2023 | Video CaptioningVideo Question Answering | CodeCode Available | 1 | 5 |
| PaLI-X: On Scaling up a Multilingual Vision and Language Model | May 29, 2023 | Chart Question Answeringdocument understanding | CodeCode Available | 1 | 5 |
| Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners | May 22, 2022 | AttributeAutomatic Speech Recognition | CodeCode Available | 1 | 5 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting | Oct 1, 2024 | Continual LearningLanguage Modeling | CodeCode Available | 1 | 5 |
| Large Language Models are Temporal and Causal Reasoners for Video Question Answering | Oct 24, 2023 | Natural Language UnderstandingQuestion Answering | CodeCode Available | 1 | 5 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes | Apr 1, 2024 | Causal DiscoveryCausal Discovery in Video Reasoning | CodeCode Available | 1 | 5 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 | 5 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Feb 11, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| Can I Trust Your Answer? Visually Grounded Video Question Answering | Sep 4, 2023 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |