| A Better Way to Attend: Attention with Trees for Video Question Answering | Sep 5, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Long Story Short: a Summarize-then-Search Method for Long Video Question Answering | Nov 2, 2023 | DiversityQuestion Answering | CodeCode Available | 0 | 5 |
| Exploring Models and Data for Image Question Answering | May 8, 2015 | Image Segmentationobject-detection | CodeCode Available | 0 | 5 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 | 5 |
| VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | May 29, 2025 | Question AnsweringVideo Generation | CodeCode Available | 0 | 5 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 | 5 |
| VidCtx: Context-aware Video Question Answering with Image Models | Dec 23, 2024 | Large Language ModelQuestion Answering | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| TVQA+: Spatio-Temporal Grounding for Video Question Answering | Apr 25, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Mar 28, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| TutorialVQA: Question Answering Dataset for Tutorial Videos | Dec 2, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| TVQA: Localized, Compositional Video Question Answering | Sep 5, 2018 | Video Question Answering | CodeCode Available | 0 | 5 |
| Listen Then See: Video Alignment with Speaker Attention | Apr 21, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 | 5 |
| Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval | Jun 5, 2022 | RetrievalSentence | CodeCode Available | 0 | 5 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 | 5 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 | 5 |
| ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | May 4, 2023 | Question AnsweringSpatio-temporal Scene Graphs | CodeCode Available | 0 | 5 |
| End-to-End Video Question-Answer Generation with Generator-Pretester Network | Jan 5, 2021 | Answer GenerationQuestion-Answer-Generation | CodeCode Available | 0 | 5 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 | 5 |
| ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | Jun 6, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering | Jan 8, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 | 5 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 | 5 |
| Relation-aware Hierarchical Attention Framework for Video Question Answering | May 13, 2021 | Question AnsweringRelation | CodeCode Available | 0 | 5 |
| Reading Between the Lanes: Text VideoQA on the Road | Jul 8, 2023 | Question AnsweringScene Text Recognition | CodeCode Available | 0 | 5 |