| FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis | Oct 25, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| VITED: Video Temporal Evidence Distillation | Mar 17, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| A Review of Deep Learning for Video Captioning | Apr 22, 2023 | Deep LearningDense Video Captioning | —Unverified | 0 | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering | Jan 3, 2024 | Question AnsweringScheduling | —Unverified | 0 | 0 |
| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding | Sep 29, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering | Jan 1, 2021 | Question AnsweringRelational Reasoning | —Unverified | 0 | 0 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Video Dialog via Progressive Inference and Cross-Transformer | Nov 1, 2019 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| VideoDistill: Language-aware Vision Distillation for Video Question Answering | Apr 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| Hierarchical Conditional Relation Networks for Multimodal Video Question Answering | Oct 18, 2020 | Question AnsweringRelation | —Unverified | 0 | 0 |
| Hierarchical Memory for Long Video QA | Jun 30, 2024 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering | Jun 25, 2021 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Dec 30, 2022 | cross-modal alignmentTGIF-Action | —Unverified | 0 | 0 |
| Holistic Multi-modal Memory Network for Movie Question Answering | Nov 12, 2018 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation | Nov 27, 2024 | Graph GenerationQuestion Answering | —Unverified | 0 | 0 |
| HySTER: A Hybrid Spatio-Temporal Event Reasoner | Jan 17, 2021 | Inductive logic programmingQuestion Answering | —Unverified | 0 | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 | 0 |
| Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering | Jul 3, 2024 | Contrastive LearningLanguage Modelling | —Unverified | 0 | 0 |
| iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Nov 16, 2020 | Common Sense ReasoningDense Video Captioning | —Unverified | 0 | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability | Jun 25, 2021 | Bias DetectionQuestion Answering | —Unverified | 0 | 0 |
| Is a Video worth n n Images? A Highly Efficient Approach to Transformer-based Video Question Answering | May 16, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Zero-Shot Video Question Answering with Procedural Programs | Dec 1, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 | 0 |
| Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering | Jul 25, 2023 | graph constructionQuestion Answering | —Unverified | 0 | 0 |
| KnowIT VQA: Answering Knowledge-Based Questions about Videos | Oct 23, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Video Instruction Tuning With Synthetic Data | Oct 3, 2024 | 3D Question Answering (3D-QA) | —Unverified | 0 | 0 |
| Knowledge-Based Visual Question Answering in Videos | Apr 17, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Knowledge Proxy Intervention for Deconfounded Video Question Answering | Jan 1, 2023 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Language-aware Visual Semantic Distillation for Video Question Answering | Jan 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering | Apr 7, 2023 | Question AnsweringQuestion Generation | —Unverified | 0 | 0 |
| (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering | Feb 18, 2022 | Question AnsweringSpatio-temporal Scene Graphs | —Unverified | 0 | 0 |
| Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA | May 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning | Mar 30, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering | Dec 1, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Learning Question-Guided Video Representation for Multi-Turn Video Question Answering | Jul 31, 2019 | NavigateQuestion Answering | —Unverified | 0 | 0 |
| Adversarial Multimodal Network for Movie Question Answering | Jun 24, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| Advancing Egocentric Video Question Answering with Multimodal Large Language Models | Apr 6, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Neural Reasoning, Fast and Slow, for Video Question Answering | Jul 10, 2019 | Natural QuestionsQuestion Answering | —Unverified | 0 | 0 |
| Learning to Rehearse in Long Sequence Memorization | Jun 2, 2021 | MemorizationQuestion Answering | —Unverified | 0 | 0 |
| Learning Trajectory-Word Alignments for Video-Language Tasks | Jan 5, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering | Apr 3, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |