| EPIC Fields: Marrying 3D Geometry and Video Understanding | Jun 14, 2023 | 3D geometryNeural Rendering | CodeCode Available | 1 | 5 |
| Technical Report: Temporal Aggregate Representations | Jun 6, 2021 | Action AnticipationAction Recognition | CodeCode Available | 1 | 5 |
| Long Movie Clip Classification with State-Space Video Models | Apr 4, 2022 | ClassificationDecoder | CodeCode Available | 1 | 5 |
| T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | Nov 29, 2024 | Data AugmentationDiversity | CodeCode Available | 1 | 5 |
| CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction | Aug 29, 2023 | Federated Learningimage-classification | CodeCode Available | 1 | 5 |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Mar 25, 2025 | HallucinationHallucination Evaluation | CodeCode Available | 1 | 5 |
| Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection | Apr 14, 2024 | Highlight DetectionMoment Retrieval | CodeCode Available | 1 | 5 |
| Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation | Mar 18, 2024 | Referring Video Object SegmentationSemantic Segmentation | CodeCode Available | 1 | 5 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 | 5 |
| Learning the Predictability of the Future | Jun 19, 2021 | Representation LearningSelf-Supervised Action Recognition | CodeCode Available | 1 | 5 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 | 5 |
| Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis | Apr 12, 2024 | Dense Video CaptioningTransfer Learning | CodeCode Available | 1 | 5 |
| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos | Apr 11, 2025 | Action UnderstandingEvent Detection | CodeCode Available | 1 | 5 |
| Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness | Jan 14, 2025 | Event ExtractionInstruction Following | CodeCode Available | 1 | 5 |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning | May 22, 2025 | Misinformationreinforcement-learning | CodeCode Available | 1 | 5 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos | Feb 25, 2025 | Graph LearningMistake Detection | CodeCode Available | 1 | 5 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 | 5 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 | 5 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 | 5 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| End-to-End Video Instance Segmentation with Transformers | Nov 30, 2020 | Instance SegmentationSegmentation | CodeCode Available | 1 | 5 |
| Streaming Video Temporal Action Segmentation In Real Time | Sep 28, 2022 | Action SegmentationLanguage Modelling | CodeCode Available | 1 | 5 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 | 5 |
| End-to-end Temporal Action Detection with Transformer | Jun 18, 2021 | Action DetectionTemporal Action Localization | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning | Sep 27, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 1 | 5 |
| FineAction: A Fine-Grained Video Dataset for Temporal Action Localization | May 24, 2021 | Action DetectionAction Localization | CodeCode Available | 1 | 5 |
| Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | Jan 1, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 1 | 5 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 | 5 |
| Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos | Aug 18, 2023 | point cloud video understandingSelf-Supervised Learning | CodeCode Available | 1 | 5 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 | 5 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges | Nov 17, 2022 | Future Hand PredictionMoment Queries | CodeCode Available | 1 | 5 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 | 5 |
| STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding | Mar 20, 2025 | Video UnderstandingZero-shot Generalization | CodeCode Available | 1 | 5 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 | 5 |
| An overview on the evaluated video retrieval tasks at TRECVID 2022 | Jun 22, 2023 | Ad-hoc video searchRetrieval | CodeCode Available | 1 | 5 |
| Stochastic Image-to-Video Synthesis using cINNs | May 10, 2021 | DiversityVideo Understanding | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| Large Scale Holistic Video Understanding | Apr 25, 2019 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 | 5 |
| FrameExit: Conditional Early Exiting for Efficient Video Recognition | Apr 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 1 | 5 |
| A Comprehensive Study of Deep Video Action Recognition | Dec 11, 2020 | Action RecognitionDeep Learning | CodeCode Available | 1 | 5 |