| Frame-Voyager: Learning to Query Frames for Video Large Language Models | Oct 4, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis | Oct 25, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge | Sep 20, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| VITED: Video Temporal Evidence Distillation | Mar 17, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| A Review of Deep Learning for Video Captioning | Apr 22, 2023 | Deep LearningDense Video Captioning | —Unverified | 0 | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering | Jan 3, 2024 | Question AnsweringScheduling | —Unverified | 0 | 0 |
| GPT-4o System Card | Oct 25, 2024 | Multiple-choiceSpatial Reasoning | —Unverified | 0 | 0 |
| GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | Nov 25, 2023 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding | Sep 29, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering | Jan 1, 2021 | Question AnsweringRelational Reasoning | —Unverified | 0 | 0 |
| Harvest Video Foundation Models via Efficient Post-Pretraining | Oct 30, 2023 | Question AnsweringText Retrieval | —Unverified | 0 | 0 |
| VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending | May 22, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Video Dialog via Progressive Inference and Cross-Transformer | Nov 1, 2019 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| VideoDistill: Language-aware Vision Distillation for Video Question Answering | Apr 1, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| Hierarchical Banzhaf Interaction for General Video-Language Representation Learning | Dec 30, 2024 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| Hierarchical Conditional Relation Networks for Multimodal Video Question Answering | Oct 18, 2020 | Question AnsweringRelation | —Unverified | 0 | 0 |
| Hierarchical Memory for Long Video QA | Jun 30, 2024 | GPUQuestion Answering | —Unverified | 0 | 0 |
| Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering | Jun 25, 2021 | ObjectQuestion Answering | —Unverified | 0 | 0 |
| HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training | Dec 30, 2022 | cross-modal alignmentTGIF-Action | —Unverified | 0 | 0 |
| Holistic Multi-modal Memory Network for Movie Question Answering | Nov 12, 2018 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation | Nov 27, 2024 | Graph GenerationQuestion Answering | —Unverified | 0 | 0 |