| IntentQA: Context-aware Video Intent Reasoning | Jan 1, 2023 | Contrastive LearningVideo Question Answering | CodeCode Available | 1 | 5 |
| Discovering Spatio-Temporal Rationales for Video Question Answering | Jul 22, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs | Sep 30, 2024 | EgoSchemaLanguage Modelling | CodeCode Available | 1 | 5 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 | 5 |
| Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA | May 13, 2020 | Image CaptioningMulti-Label Classification | CodeCode Available | 1 | 5 |
| MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models | Mar 23, 2023 | Auxiliary LearningMultimodal Sentiment Analysis | CodeCode Available | 1 | 5 |
| AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering | Nov 25, 2023 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder | Jun 28, 2025 | Image SegmentationLarge Language Model | CodeCode Available | 1 | 5 |
| DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization | Jun 1, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge | Feb 25, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 | 5 |
| Video Dialog as Conversation about Objects Living in Space-Time | Jul 8, 2022 | ObjectRelational Reasoning | CodeCode Available | 1 | 5 |
| Video Graph Transformer for Video Question Answering | Jul 12, 2022 | Question AnsweringRelation | CodeCode Available | 1 | 5 |
| Video-Language Alignment via Spatio-Temporal Graph Transformer | Jul 16, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| Hierarchical Conditional Relation Networks for Video Question Answering | Feb 25, 2020 | Audio-Visual Question Answering (AVQA)Question Answering | CodeCode Available | 1 | 5 |
| Video as Conditional Graph Hierarchy for Multi-Granular Question Answering | Dec 12, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| DAM: Dynamic Adapter Merging for Continual Video QA Learning | Mar 13, 2024 | Continual Learningimage-classification | CodeCode Available | 1 | 5 |
| Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Jul 17, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training | May 1, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| On the hidden treasure of dialog in video question answering | Mar 26, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| HawkEye: Training Video-Text LLMs for Grounding Text in Videos | Mar 15, 2024 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation | Jun 8, 2021 | Multi-Task LearningQuestion Answering | CodeCode Available | 1 | 5 |
| LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | Jun 14, 2022 | DecoderLanguage Modeling | CodeCode Available | 1 | 5 |
| AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant | Nov 30, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling | Oct 8, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| Paxion: Patching Action Knowledge in Video-Language Foundation Models | May 18, 2023 | Action UnderstandingDiagnostic | CodeCode Available | 1 | 5 |
| Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering | Oct 12, 2024 | Answer GenerationBlocking | CodeCode Available | 1 | 5 |
| Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | Aug 26, 2024 | FormLanguage Modelling | CodeCode Available | 1 | 5 |
| Encoding and Controlling Global Semantics for Long-form Video Question Answering | May 30, 2024 | FormQuestion Answering | CodeCode Available | 1 | 5 |
| Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | Jan 3, 2024 | Action DetectionHuman-Object Interaction Detection | CodeCode Available | 1 | 5 |
| CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions | Dec 8, 2020 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| Location-aware Graph Convolutional Networks for Video Question Answering | Aug 7, 2020 | Action Recognitiongraph construction | CodeCode Available | 1 | 5 |
| Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | Feb 11, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 | 5 |
| VideoCon: Robust Video-Language Alignment via Contrast Captions | Nov 15, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Visual Commonsense-aware Representation Network for Video Captioning | Nov 17, 2022 | Caption GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos | Dec 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| TutorialVQA: Question Answering Dataset for Tutorial Videos | Dec 2, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| TVQA: Localized, Compositional Video Question Answering | Sep 5, 2018 | Video Question Answering | CodeCode Available | 0 | 5 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 | 5 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 | 5 |
| Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | Jun 5, 2022 | RetrievalSentence | CodeCode Available | 0 | 5 |
| TVQA+: Spatio-Temporal Grounding for Video Question Answering | Apr 25, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks | Dec 2, 2024 | Multi-Object TrackingObject Tracking | CodeCode Available | 0 | 5 |
| Extending Compositional Attention Networks for Social Reasoning in Videos | Oct 3, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |