| Reading Between the Lanes: Text VideoQA on the Road | Jul 8, 2023 | Question AnsweringScene Text Recognition | CodeCode Available | 0 |
| VidCtx: Context-aware Video Question Answering with Image Models | Dec 23, 2024 | Large Language ModelQuestion Answering | CodeCode Available | 0 |
| Long Story Short: a Summarize-then-Search Method for Long Video Question Answering | Nov 2, 2023 | DiversityQuestion Answering | CodeCode Available | 0 |
| End-to-End Video Question-Answer Generation with Generator-Pretester Network | Jan 5, 2021 | Answer GenerationQuestion-Answer-Generation | CodeCode Available | 0 |
| STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering | Jan 8, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| OmniNet: A unified architecture for multi-modal multi-task learning | Jul 17, 2019 | Image CaptioningMulti-Task Learning | CodeCode Available | 0 |
| Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning | Jun 9, 2025 | Future predictionQuestion Answering | CodeCode Available | 0 |
| ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | May 4, 2023 | Question AnsweringSpatio-temporal Scene Graphs | CodeCode Available | 0 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 |
| Visual Choice of Plausible Alternatives: An Evaluation of Image-based Commonsense Causal Reasoning | May 1, 2018 | Commonsense Causal ReasoningImage Captioning | CodeCode Available | 0 |
| CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering | Nov 7, 2022 | Add - POAdd - PQ | CodeCode Available | 0 |
| On Modality Bias in the TVQA Dataset | Dec 18, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| MemexQA: Visual Memex Question Answering | Aug 4, 2017 | Memex Question AnsweringQuestion Answering | CodeCode Available | 0 |
| MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning | Jul 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| A Better Way to Attend: Attention with Trees for Video Question Answering | Sep 5, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| DramaQA: Character-Centered Video Story Understanding with Hierarchical QA | May 7, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 0 |
| Relation-aware Hierarchical Attention Framework for Video Question Answering | May 13, 2021 | Question AnsweringRelation | CodeCode Available | 0 |
| Exploring Models and Data for Image Question Answering | May 8, 2015 | Image Segmentationobject-detection | CodeCode Available | 0 |
| Open-Ended Multi-Modal Relational Reasoning for Video Question Answering | Dec 1, 2020 | Question AnsweringRelational Reasoning | CodeCode Available | 0 |
| Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks | Dec 2, 2024 | Multi-Object TrackingObject Tracking | CodeCode Available | 0 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 |