| Occluded Video Instance Segmentation: A Benchmark | Feb 2, 2021 | Instance SegmentationSegmentation | CodeCode Available | 1 | 5 |
| BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding | Mar 27, 2025 | FormLanguage Modeling | CodeCode Available | 1 | 5 |
| No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding | May 14, 2024 | Action DetectionGPU | CodeCode Available | 1 | 5 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 | 5 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions | Apr 21, 2022 | Action DetectionVideo Understanding | CodeCode Available | 1 | 5 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 | 5 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization | Jun 14, 2020 | Action DetectionAction Localization | CodeCode Available | 1 | 5 |
| A Multigrid Method for Efficiently Training Video Models | Dec 2, 2019 | Action DetectionAction Recognition | CodeCode Available | 1 | 5 |
| InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges | Nov 17, 2022 | Future Hand PredictionMoment Queries | CodeCode Available | 1 | 5 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 | 5 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | May 27, 2025 | Reinforcement Learning (RL)Video Understanding | CodeCode Available | 1 | 5 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 | 5 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection | May 5, 2022 | Action Detectionobject-detection | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation | Jan 14, 2025 | MambaVideo Understanding | CodeCode Available | 1 | 5 |
| AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding | Jun 19, 2024 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 | 5 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 | 5 |
| Multimodal Distillation for Egocentric Action Recognition | Jul 14, 2023 | Action RecognitionKnowledge Distillation | CodeCode Available | 1 | 5 |
| Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding | Nov 25, 2023 | Video Understanding | CodeCode Available | 1 | 5 |
| AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | May 23, 2017 | Actin DetectionAction Detection | CodeCode Available | 1 | 5 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context | Apr 14, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| AutoVideo: An Automated Video Action Recognition System | Aug 9, 2021 | Action RecognitionAutoML | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| Action Scene Graphs for Long-Form Understanding of Egocentric Videos | Dec 6, 2023 | Action AnticipationForm | CodeCode Available | 1 | 5 |
| Large Scale Holistic Video Understanding | Apr 25, 2019 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing | Nov 28, 2022 | Activity RecognitionFew Shot Action Recognition | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| Learning Video Context as Interleaved Multimodal Sequences | Jul 31, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 | 5 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 | 5 |
| Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning | Jan 1, 2023 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| MotionSqueeze: Neural Motion Feature Learning for Video Understanding | Jul 20, 2020 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions | May 16, 2021 | Action DetectionAction Localization | CodeCode Available | 1 | 5 |
| Crossover Learning for Fast Online Video Instance Segmentation | Apr 13, 2021 | Instance SegmentationSemantic Segmentation | CodeCode Available | 1 | 5 |
| From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities | Jan 10, 2025 | Human-Object Interaction DetectionKnowledge Distillation | CodeCode Available | 1 | 5 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 | 5 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 | 5 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 | 5 |