| VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding | Jul 17, 2025 | Video GroundingVideo Understanding | —Unverified | 0 |
| UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks | Jul 15, 2025 | Video CaptioningVideo Understanding | CodeCode Available | 1 |
| Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | Jul 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Jul 14, 2025 | Scene UnderstandingSpatial Reasoning | —Unverified | 0 |
| Omni-Video: Democratizing Unified Video Understanding and Generation | Jul 8, 2025 | Video GenerationVideo Understanding | CodeCode Available | 2 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 |
| Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models | Jul 8, 2025 | Future predictionLarge Language Model | —Unverified | 0 |
| Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation | Jul 8, 2025 | Depth EstimationDepth Prediction | —Unverified | 0 |
| Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges | Jul 2, 2025 | Video Understanding | —Unverified | 0 |
| Kwai Keye-VL Technical Report | Jul 2, 2025 | Instruction FollowingReinforcement Learning (RL) | CodeCode Available | 4 |
| GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jul 1, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 7 |
| CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs | Jul 1, 2025 | Text GenerationVideo Understanding | —Unverified | 0 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment | Jun 28, 2025 | Dynamic Time WarpingLarge Language Model | CodeCode Available | 0 |
| Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs | Jun 27, 2025 | MMEVideo MME | —Unverified | 0 |
| LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs | Jun 27, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes | Jun 26, 2025 | AttributeQuestion Answering | —Unverified | 0 |
| Task-Aware KV Compression For Cost-Effective Long Video Understanding | Jun 26, 2025 | Video Understanding | CodeCode Available | 0 |
| PEVLM: Parallel Encoding for Vision-Language Models | Jun 24, 2025 | Autonomous DrivingVideo Understanding | —Unverified | 0 |
| GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning | Jun 19, 2025 | Multimodal Reasoningreinforcement-learning | —Unverified | 0 |
| video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Jun 18, 2025 | Audio captioningLarge Language Model | CodeCode Available | 2 |
| InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | Jun 18, 2025 | GPUStreaming video understanding | —Unverified | 0 |
| EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization | Jun 17, 2025 | Multi-Instance RetrievalRetrieval | CodeCode Available | 0 |
| MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models | Jun 16, 2025 | Video Understanding | —Unverified | 0 |
| AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding | Jun 16, 2025 | Optical Character Recognition (OCR)RAG | CodeCode Available | 0 |
| M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation | Jun 15, 2025 | ObjectSemantic Segmentation | CodeCode Available | 1 |
| Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation | Jun 13, 2025 | Anomaly DetectionClustering | CodeCode Available | 1 |
| VideoDeepResearch: Long Video Understanding With Agentic Tool Using | Jun 12, 2025 | MMEVideo MME | CodeCode Available | 2 |
| HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios | Jun 11, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding | Jun 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding | Jun 9, 2025 | RAGRetrieval | —Unverified | 0 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 |
| Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding | Jun 9, 2025 | Contrastive LearningVideo Editing | —Unverified | 0 |
| SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis | Jun 9, 2025 | Action ClassificationBenchmarking | —Unverified | 0 |
| Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models | Jun 6, 2025 | SegmentationVideo Understanding | —Unverified | 0 |
| Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision | Jun 6, 2025 | Video Understanding | CodeCode Available | 0 |
| TextVidBench: A Benchmark for Long Video Scene Text Understanding | Jun 5, 2025 | Prompt EngineeringQuestion Answering | —Unverified | 0 |
| APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval | Jun 5, 2025 | Information RetrievalRetrieval | —Unverified | 0 |
| DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation | Jun 5, 2025 | Motion CompensationOptical Flow Estimation | —Unverified | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding | Jun 4, 2025 | MMEVideo MME | —Unverified | 0 |
| METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding | Jun 3, 2025 | Video Understanding | CodeCode Available | 0 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| InterRVOS: Interaction-aware Referring Video Object Segmentation | Jun 3, 2025 | 8kObject | —Unverified | 0 |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | Jun 2, 2025 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency | Jun 2, 2025 | reinforcement-learningReinforcement Learning | CodeCode Available | 2 |
| FlexSelect: Flexible Token Selection for Efficient Long Video Understanding | Jun 1, 2025 | Video Understanding | —Unverified | 0 |
| Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis | May 31, 2025 | Scene SegmentationSegmentation | —Unverified | 0 |
| SiLVR: A Simple Language-based Video Reasoning Framework | May 30, 2025 | MathMME | CodeCode Available | 1 |