| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition | Jan 19, 2025 | Action RecognitionRelation Classification | —Unverified | 0 |
| Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions | May 28, 2024 | Action RecognitionVideo Recognition | —Unverified | 0 |
| HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding | Jan 1, 2025 | Question AnsweringVideo Understanding | —Unverified | 0 |
| HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding | Dec 5, 2023 | DiversityGraph Generation | —Unverified | 0 |
| Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis | May 28, 2021 | Multimodal Sentiment AnalysisObject Recognition | —Unverified | 0 |
| HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do | May 1, 2020 | Video Understanding | —Unverified | 0 |
| H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving | Jan 8, 2025 | Autonomous DrivingMamba | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos | Dec 2, 2018 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | Apr 19, 2025 | Video Understanding | —Unverified | 0 |
| HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding | Jan 25, 2025 | Action UnderstandingEmotion Recognition | —Unverified | 0 |
| HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data | Dec 23, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| HuMoCon: Concept Discovery for Human Motion Understanding | Jan 1, 2025 | Video Understanding | —Unverified | 0 |
| i-Code: An Integrative and Composable Multimodal Learning Framework | May 3, 2022 | Contrastive LearningVideo Understanding | —Unverified | 0 |
| Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding | Aug 22, 2022 | Action RecognitionMulti-Task Learning | —Unverified | 0 |
| Identity-aware Graph Memory Network for Action Detection | Aug 26, 2021 | Action DetectionGraph Neural Network | —Unverified | 0 |
| iMOVE: Instance-Motion-Aware Video Understanding | Feb 17, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Impossible Videos | Mar 18, 2025 | counterfactualVideo Generation | —Unverified | 0 |
| Improving LLM Video Understanding with 16 Frames Per Second | Mar 18, 2025 | MMEVideo MME | —Unverified | 0 |
| Improving Video Model Transfer With Dynamic Representation Learning | Jan 1, 2022 | Action ClassificationKnowledge Distillation | —Unverified | 0 |
| Inductive Attention for Video Action Anticipation | Dec 17, 2022 | Action AnticipationAction Recognition | —Unverified | 0 |
| InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | Jun 18, 2025 | GPUStreaming video understanding | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Instrument-tissue Interaction Detection Framework for Surgical Video Understanding | Mar 30, 2024 | Video Understanding | —Unverified | 0 |
| Integrated Object Detection and Tracking with Tracklet-Conditioned Detection | Nov 27, 2018 | Objectobject-detection | —Unverified | 0 |
| InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output | Jul 3, 2024 | ArticlesImage Comprehension | —Unverified | 0 |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model | Jan 21, 2025 | Instruction FollowingMathematical Reasoning | —Unverified | 0 |
| InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | Jul 13, 2023 | Action RecognitionContrastive Learning | —Unverified | 0 |
| InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling | Jan 21, 2025 | Object TrackingReferring Expression Segmentation | —Unverified | 0 |
| InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model | Feb 26, 2025 | Video Quality AssessmentVideo Understanding | —Unverified | 0 |
| Interpretable Action Recognition on Hard to Classify Actions | Sep 19, 2024 | Action RecognitionDepth Estimation | —Unverified | 0 |
| InterRVOS: Interaction-aware Referring Video Object Segmentation | Jun 3, 2025 | 8kObject | —Unverified | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Inverse Compositional Learning for Weakly-supervised Relation Grounding | Jan 1, 2023 | RelationVideo Understanding | —Unverified | 0 |
| IPAD: Industrial Process Anomaly Detection Dataset | Apr 23, 2024 | Anomaly DetectionVideo Anomaly Detection | —Unverified | 0 |
| IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes | Jun 26, 2025 | AttributeQuestion Answering | —Unverified | 0 |
| IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs | Dec 13, 2024 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Apr 2, 2025 | Action RecognitionAll | —Unverified | 0 |
| Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction | Dec 28, 2022 | Data AugmentationFace Swapping | —Unverified | 0 |
| Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals | Jul 1, 2017 | Video Understanding | —Unverified | 0 |
| Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input | Aug 28, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| KeyVideoLLM: Towards Large-scale Video Keyframe Selection | Jul 3, 2024 | Data CompressionManagement | —Unverified | 0 |
| Kill Two Birds With One Stone: Boosting Both Object Detection Accuracy and Speed With adaptive Patch-of-Interest Composition | Aug 12, 2017 | Objectobject-detection | —Unverified | 0 |
| KnowIT VQA: Answering Knowledge-Based Questions about Videos | Oct 23, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Knowledge-Based Visual Question Answering in Videos | Apr 17, 2020 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Koala: Key frame-conditioned long video-LLM | Apr 5, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 |
| Label Denoising with Large Ensembles of Heterogeneous Neural Networks | Sep 12, 2018 | Data AugmentationDenoising | —Unverified | 0 |
| Language as the Medium: Multimodal Video Classification through text only | Sep 19, 2023 | Action RecognitionVideo Classification | —Unverified | 0 |
| M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers | Apr 2, 2021 | DiagnosticVideo Editing | —Unverified | 0 |