| Inductive Attention for Video Action Anticipation | Dec 17, 2022 | Action AnticipationAction Recognition | —Unverified | 0 |
| Discrete neural representations for explainable anomaly detection | Dec 10, 2021 | Anomaly DetectionObject | —Unverified | 0 |
| Improving Video Model Transfer With Dynamic Representation Learning | Jan 1, 2022 | Action ClassificationKnowledge Distillation | —Unverified | 0 |
| Improving LLM Video Understanding with 16 Frames Per Second | Mar 18, 2025 | MMEVideo MME | —Unverified | 0 |
| Discerning Generic Event Boundaries in Long-Form Wild Videos | Jun 18, 2021 | Boundary DetectionForm | —Unverified | 0 |
| Action Understanding with Multiple Classes of Actors | Apr 27, 2017 | Action RecognitionAction Segmentation | —Unverified | 0 |
| Impossible Videos | Mar 18, 2025 | counterfactualVideo Generation | —Unverified | 0 |
| Learning to Focus on the Foreground for Temporal Sentence Grounding | Oct 1, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 |
| iMOVE: Instance-Motion-Aware Video Understanding | Feb 17, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Identity-aware Graph Memory Network for Action Detection | Aug 26, 2021 | Action DetectionGraph Neural Network | —Unverified | 0 |
| Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding | Aug 22, 2022 | Action RecognitionMulti-Task Learning | —Unverified | 0 |
| AirLetters: An Open Video Dataset of Characters Drawn in the Air | Oct 3, 2024 | Video Understanding | —Unverified | 0 |
| Action Sensitivity Learning for Temporal Action Localization | May 25, 2023 | Action LocalizationMoment Queries | —Unverified | 0 |
| i-Code: An Integrative and Composable Multimodal Learning Framework | May 3, 2022 | Contrastive LearningVideo Understanding | —Unverified | 0 |
| Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding | Mar 24, 2024 | Dense Video CaptioningTemporal Localization | —Unverified | 0 |
| Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding | Jun 1, 2022 | Knowledge GraphsVideo Understanding | —Unverified | 0 |
| HuMoCon: Concept Discovery for Human Motion Understanding | Jan 1, 2025 | Video Understanding | —Unverified | 0 |
| HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data | Dec 23, 2024 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection | Jul 29, 2020 | object-detectionObject Detection | —Unverified | 0 |
| HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding | Jan 25, 2025 | Action UnderstandingEmotion Recognition | —Unverified | 0 |
| FE-Adapter: Adapting Image-based Emotion Classifiers to Videos | Aug 5, 2024 | Dynamic Facial Expression RecognitionEmotion Recognition | —Unverified | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 |
| DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding | May 19, 2018 | Action Recognition In VideosGesture Recognition | —Unverified | 0 |
| AVD2: Accident Video Diffusion for Accident Video Description | Feb 20, 2025 | Autonomous DrivingScene Understanding | —Unverified | 0 |
| How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | Apr 19, 2025 | Video Understanding | —Unverified | 0 |
| How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos | Dec 2, 2018 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network | Jun 2, 2019 | General ClassificationGraph Neural Network | —Unverified | 0 |
| MM-Ego: Towards Building Egocentric Multimodal LLMs | Oct 9, 2024 | Video Understanding | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving | Jan 8, 2025 | Autonomous DrivingMamba | —Unverified | 0 |
| HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do | May 1, 2020 | Video Understanding | —Unverified | 0 |
| Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis | May 28, 2021 | Multimodal Sentiment AnalysisObject Recognition | —Unverified | 0 |
| Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | May 23, 2025 | FormQuestion Answering | —Unverified | 0 |
| Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search | Dec 9, 2021 | Neural Architecture SearchVideo Recognition | —Unverified | 0 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 |
| HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding | Dec 5, 2023 | DiversityGraph Generation | —Unverified | 0 |
| Deep Spatio-Temporal Random Fields for Efficient Video Segmentation | Jul 3, 2018 | Instance SegmentationSemantic Segmentation | —Unverified | 0 |
| HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding | Jan 1, 2025 | Question AnsweringVideo Understanding | —Unverified | 0 |
| Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training | Jul 5, 2020 | DecoderQuestion Answering | —Unverified | 0 |
| AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark | Oct 4, 2024 | Image CaptioningVideo Understanding | —Unverified | 0 |
| Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions | May 28, 2024 | Action RecognitionVideo Recognition | —Unverified | 0 |
| HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition | Jan 19, 2025 | Action RecognitionRelation Classification | —Unverified | 0 |
| Deep learning for action spotting in association football videos | Oct 2, 2024 | Action SpottingBenchmarking | —Unverified | 0 |
| HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model | Jun 1, 2024 | Action RecognitionActivity Recognition | —Unverified | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 |
| Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions | Mar 11, 2024 | counterfactualVideo Editing | —Unverified | 0 |
| Cycle-Contrast for Self-Supervised Video Representation Learning | Oct 28, 2020 | Action RecognitionContrastive Learning | —Unverified | 0 |
| A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset | Nov 19, 2022 | Common Sense ReasoningGraph Embedding | —Unverified | 0 |
| HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions | Sep 16, 2024 | Dimensionality ReductionVideo Understanding | —Unverified | 0 |
| Aggregating Frame-level Features for Large-Scale Video Classification | Jul 4, 2017 | ClassificationGeneral Classification | —Unverified | 0 |