| M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation | Jun 15, 2025 | ObjectSemantic Segmentation | CodeCode Available | 1 |
| Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation | Jun 13, 2025 | Anomaly DetectionClustering | CodeCode Available | 1 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 |
| SiLVR: A Simple Language-based Video Reasoning Framework | May 30, 2025 | MathMME | CodeCode Available | 1 |
| DisTime: Distribution-based Time Representation for Video Large Language Models | May 30, 2025 | Temporal LocalizationVideo Understanding | CodeCode Available | 1 |
| VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software | May 30, 2025 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling | May 29, 2025 | Video Understanding | CodeCode Available | 1 |
| VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? | May 29, 2025 | Video Understanding | CodeCode Available | 1 |
| VidText: Towards Comprehensive Evaluation for Video Text Understanding | May 28, 2025 | Multimodal ReasoningOptical Character Recognition (OCR) | CodeCode Available | 1 |
| MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | May 27, 2025 | Reinforcement Learning (RL)Video Understanding | CodeCode Available | 1 |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning | May 22, 2025 | Misinformationreinforcement-learning | CodeCode Available | 1 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection | May 5, 2025 | Anomaly DetectionAnomaly Detection In Surveillance Videos | CodeCode Available | 1 |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | May 2, 2025 | Dense CaptioningHighlight Detection | CodeCode Available | 1 |
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | May 2, 2025 | Anomaly DetectionCommon Sense Reasoning | CodeCode Available | 1 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 |
| VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models | Apr 17, 2025 | HallucinationVideo Understanding | CodeCode Available | 1 |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context | Apr 14, 2025 | Video Understanding | CodeCode Available | 1 |
| F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos | Apr 11, 2025 | Action UnderstandingEvent Detection | CodeCode Available | 1 |
| Slow-Fast Architecture for Video Multi-Modal Large Language Models | Apr 2, 2025 | Video Understanding | CodeCode Available | 1 |
| BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding | Mar 27, 2025 | FormLanguage Modeling | CodeCode Available | 1 |
| Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation | Mar 25, 2025 | HallucinationHallucination Evaluation | CodeCode Available | 1 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps | Mar 23, 2025 | Scene SegmentationVideo Understanding | CodeCode Available | 1 |
| V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction | Mar 22, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 1 |
| STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding | Mar 20, 2025 | Video UnderstandingZero-shot Generalization | CodeCode Available | 1 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma? | Mar 16, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos | Mar 9, 2025 | Action LocalizationBoundary Detection | CodeCode Available | 1 |
| Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning | Mar 2, 2025 | Large Language ModelMulti-Instance Retrieval | CodeCode Available | 1 |
| Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos | Feb 25, 2025 | Graph LearningMistake Detection | CodeCode Available | 1 |
| VRoPE: Rotary Position Embedding for Video Large Language Models | Feb 17, 2025 | PositionVideo Understanding | CodeCode Available | 1 |
| video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model | Feb 17, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness | Jan 14, 2025 | Event ExtractionInstruction Following | CodeCode Available | 1 |
| AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation | Jan 14, 2025 | MambaVideo Understanding | CodeCode Available | 1 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 |
| VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning | Jan 12, 2025 | Dense Video CaptioningVideo Captioning | CodeCode Available | 1 |
| From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities | Jan 10, 2025 | Human-Object Interaction DetectionKnowledge Distillation | CodeCode Available | 1 |
| Unifying Specialized Visual Encoders for Video Language Models | Jan 2, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 |
| Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models | Jan 1, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding | Dec 29, 2024 | Video CompressionVideo Understanding | CodeCode Available | 1 |
| Do Language Models Understand Time? | Dec 18, 2024 | Action RecognitionAnomaly Detection | CodeCode Available | 1 |
| FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding | Dec 18, 2024 | Highlight DetectionMoment Retrieval | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |