| Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering | Apr 8, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Hierarchical Banzhaf Interaction for General Video-Language Representation Learning | Dec 30, 2024 | Contrastive LearningQuestion Answering | CodeCode Available | 0 | 5 |
| ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | May 4, 2023 | Question AnsweringSpatio-temporal Scene Graphs | CodeCode Available | 0 | 5 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 | 5 |
| OmniNet: A unified architecture for multi-modal multi-task learning | Jul 17, 2019 | Image CaptioningMulti-Task Learning | CodeCode Available | 0 | 5 |
| ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos | Nov 2, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 | 5 |
| TutorialVQA: Question Answering Dataset for Tutorial Videos | Dec 2, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| TVQA: Localized, Compositional Video Question Answering | Sep 5, 2018 | Video Question Answering | CodeCode Available | 0 | 5 |
| TVQA+: Spatio-Temporal Grounding for Video Question Answering | Apr 25, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning | Jun 9, 2025 | Future predictionQuestion Answering | CodeCode Available | 0 | 5 |
| Listen Then See: Video Alignment with Speaker Attention | Apr 21, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 | 5 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 | 5 |
| Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks | Oct 7, 2023 | Action RecognitionMultiple-choice | CodeCode Available | 0 | 5 |
| Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks | Dec 2, 2024 | Multi-Object TrackingObject Tracking | CodeCode Available | 0 | 5 |
| YTCommentQA: Video Question Answerability in Instructional Videos | Jan 30, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | CodeCode Available | 0 | 5 |
| Extending Compositional Attention Networks for Social Reasoning in Videos | Oct 3, 2022 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Exploring Temporal Concurrency for Video-Language Representation Learning | Jan 1, 2023 | Dynamic Time WarpingMetric Learning | CodeCode Available | 0 | 5 |
| Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer | Feb 4, 2023 | Computational EfficiencyQuestion Answering | CodeCode Available | 0 | 5 |
| CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering | Nov 7, 2022 | Add - POAdd - PQ | CodeCode Available | 0 | 5 |
| Unmasked Teacher: Towards Training-Efficient Video Foundation Models | Mar 28, 2023 | Action ClassificationAction Recognition | CodeCode Available | 0 | 5 |
| Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering | Jun 19, 2021 | AI AgentQuestion Answering | CodeCode Available | 0 | 5 |
| On Modality Bias in the TVQA Dataset | Dec 18, 2020 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| Vamos: Versatile Action Models for Video Understanding | Nov 22, 2023 | EgoSchemaHard Attention | CodeCode Available | 0 | 5 |
| Exploring Models and Data for Image Question Answering | May 8, 2015 | Image Segmentationobject-detection | CodeCode Available | 0 | 5 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | Dec 6, 2024 | document understandingHallucination | CodeCode Available | 0 | 5 |
| Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering | May 9, 2022 | multimodal interactionQuestion Answering | CodeCode Available | 0 | 5 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 | 5 |
| Lightweight Recurrent Cross-modal Encoder for Video Question Answering | Jun 30, 2023 | Action RecognitionQuestion Answering | CodeCode Available | 0 | 5 |
| Verbs in Action: Improving verb understanding in video-language models | Apr 13, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 0 | 5 |
| VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | May 29, 2025 | Question AnsweringVideo Generation | CodeCode Available | 0 | 5 |
| VidCtx: Context-aware Video Question Answering with Image Models | Dec 23, 2024 | Large Language ModelQuestion Answering | CodeCode Available | 0 | 5 |
| Open-Ended Multi-Modal Relational Reasoning for Video Question Answering | Dec 1, 2020 | Question AnsweringRelational Reasoning | CodeCode Available | 0 | 5 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | Jun 6, 2019 | Question AnsweringVideo Question Answering | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| Long Story Short: a Summarize-then-Search Method for Long Video Question Answering | Nov 2, 2023 | DiversityQuestion Answering | CodeCode Available | 0 | 5 |
| Reading Between the Lanes: Text VideoQA on the Road | Jul 8, 2023 | Question AnsweringScene Text Recognition | CodeCode Available | 0 | 5 |
| A Joint Sequence Fusion Model for Video Question Answering and Retrieval | Aug 7, 2018 | DecoderMultiple-choice | CodeCode Available | 0 | 5 |
| LAVIS: A Library for Language-Vision Intelligence | Sep 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 0 | 5 |
| End-to-End Video Question-Answer Generation with Generator-Pretester Network | Jan 5, 2021 | Answer GenerationQuestion-Answer-Generation | CodeCode Available | 0 | 5 |
| FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos | Dec 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| MemexQA: Visual Memex Question Answering | Aug 4, 2017 | Memex Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning | Jul 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 | 5 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 | 5 |
| VideoQA in the Era of LLMs: An Empirical Study | Aug 8, 2024 | Multimodal Large Language ModelVideo Question Answering | CodeCode Available | 0 | 5 |
| LongVILA: Scaling Long-Context Visual Language Models for Long Videos | Aug 19, 2024 | Video CaptioningVideo Question Answering | CodeCode Available | 0 | 5 |
| ActBERT: Learning Global-Local Video-Text Representations | Nov 14, 2020 | Action SegmentationQuestion Answering | CodeCode Available | 0 | 5 |
| Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | May 16, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 | 5 |