| VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding | Jul 17, 2025 | Video GroundingVideo Understanding | —Unverified | 0 |
| Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | Jul 14, 2025 | Large Language ModelMultimodal Large Language Model | —Unverified | 0 |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Jul 14, 2025 | Scene UnderstandingSpatial Reasoning | —Unverified | 0 |
| Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation | Jul 8, 2025 | Depth EstimationDepth Prediction | —Unverified | 0 |
| Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models | Jul 8, 2025 | Future predictionLarge Language Model | —Unverified | 0 |
| Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges | Jul 2, 2025 | Video Understanding | —Unverified | 0 |
| CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs | Jul 1, 2025 | Text GenerationVideo Understanding | —Unverified | 0 |
| ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment | Jun 28, 2025 | Dynamic Time WarpingLarge Language Model | CodeCode Available | 0 |
| Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs | Jun 27, 2025 | MMEVideo MME | —Unverified | 0 |
| IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes | Jun 26, 2025 | AttributeQuestion Answering | —Unverified | 0 |
| Task-Aware KV Compression For Cost-Effective Long Video Understanding | Jun 26, 2025 | Video Understanding | CodeCode Available | 0 |
| PEVLM: Parallel Encoding for Vision-Language Models | Jun 24, 2025 | Autonomous DrivingVideo Understanding | —Unverified | 0 |
| GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning | Jun 19, 2025 | Multimodal Reasoningreinforcement-learning | —Unverified | 0 |
| InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | Jun 18, 2025 | GPUStreaming video understanding | —Unverified | 0 |
| EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization | Jun 17, 2025 | Multi-Instance RetrievalRetrieval | CodeCode Available | 0 |
| MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models | Jun 16, 2025 | Video Understanding | —Unverified | 0 |
| AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding | Jun 16, 2025 | Optical Character Recognition (OCR)RAG | CodeCode Available | 0 |
| HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios | Jun 11, 2025 | Action RecognitionAction Segmentation | CodeCode Available | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding | Jun 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding | Jun 9, 2025 | Contrastive LearningVideo Editing | —Unverified | 0 |
| SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding | Jun 9, 2025 | RAGRetrieval | —Unverified | 0 |
| SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis | Jun 9, 2025 | Action ClassificationBenchmarking | —Unverified | 0 |
| Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision | Jun 6, 2025 | Video Understanding | CodeCode Available | 0 |
| Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models | Jun 6, 2025 | SegmentationVideo Understanding | —Unverified | 0 |
| DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation | Jun 5, 2025 | Motion CompensationOptical Flow Estimation | —Unverified | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 |
| APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval | Jun 5, 2025 | Information RetrievalRetrieval | —Unverified | 0 |
| TextVidBench: A Benchmark for Long Video Scene Text Understanding | Jun 5, 2025 | Prompt EngineeringQuestion Answering | —Unverified | 0 |
| DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding | Jun 4, 2025 | MMEVideo MME | —Unverified | 0 |
| METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding | Jun 3, 2025 | Video Understanding | CodeCode Available | 0 |
| InterRVOS: Interaction-aware Referring Video Object Segmentation | Jun 3, 2025 | 8kObject | —Unverified | 0 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | Jun 2, 2025 | Action RecognitionVideo Understanding | —Unverified | 0 |
| FlexSelect: Flexible Token Selection for Efficient Long Video Understanding | Jun 1, 2025 | Video Understanding | —Unverified | 0 |
| Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis | May 31, 2025 | Scene SegmentationSegmentation | —Unverified | 0 |
| Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders | May 30, 2025 | Video Understanding | —Unverified | 0 |
| Learning reusable concepts across different egocentric video understanding tasks | May 30, 2025 | Video Understanding | —Unverified | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 |
| Time Blindness: Why Video-Language Models Can't See What Humans Can? | May 30, 2025 | Temporal SequencesVideo Understanding | —Unverified | 0 |
| ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding | May 29, 2025 | AvgVideo Understanding | CodeCode Available | 0 |
| MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection | May 29, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding | May 29, 2025 | RAGRetrieval-augmented Generation | —Unverified | 0 |
| Universal Visuo-Tactile Video Understanding for Embodied Interaction | May 28, 2025 | FrictionLarge Language Model | —Unverified | 0 |
| Two Causally Related Needles in a Video Haystack | May 26, 2025 | Video UnderstandingVisual Grounding | —Unverified | 0 |
| TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos | May 26, 2025 | AttributeVideo Understanding | CodeCode Available | 0 |
| AdaTP: Attention-Debiased Token Pruning for Video Large Language Models | May 26, 2025 | Video Understanding | —Unverified | 0 |
| Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs | May 25, 2025 | Video Understanding | —Unverified | 0 |
| Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | May 23, 2025 | FormQuestion Answering | —Unverified | 0 |
| SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding | May 22, 2025 | Action ClassificationAutomatic Speech Recognition | CodeCode Available | 0 |