| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 |
| Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders | May 30, 2025 | Video Understanding | —Unverified | 0 |
| SiLVR: A Simple Language-based Video Reasoning Framework | May 30, 2025 | MathMME | CodeCode Available | 1 |
| Learning reusable concepts across different egocentric video understanding tasks | May 30, 2025 | Video Understanding | —Unverified | 0 |
| VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software | May 30, 2025 | Question AnsweringSpatial Reasoning | CodeCode Available | 1 |
| Time Blindness: Why Video-Language Models Can't See What Humans Can? | May 30, 2025 | Temporal SequencesVideo Understanding | —Unverified | 0 |
| Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding | May 29, 2025 | RAGRetrieval-augmented Generation | —Unverified | 0 |
| MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection | May 29, 2025 | image-classificationImage Classification | —Unverified | 0 |
| ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding | May 29, 2025 | AvgVideo Understanding | CodeCode Available | 0 |
| VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models | May 29, 2025 | Self-Supervised LearningVideo Generation | CodeCode Available | 2 |
| VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? | May 29, 2025 | Video Understanding | CodeCode Available | 1 |
| PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling | May 29, 2025 | Video Understanding | CodeCode Available | 1 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 |
| Universal Visuo-Tactile Video Understanding for Embodied Interaction | May 28, 2025 | FrictionLarge Language Model | —Unverified | 0 |
| VidText: Towards Comprehensive Evaluation for Video Text Understanding | May 28, 2025 | Multimodal ReasoningOptical Character Recognition (OCR) | CodeCode Available | 1 |
| MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding | May 27, 2025 | Reinforcement Learning (RL)Video Understanding | CodeCode Available | 1 |
| Two Causally Related Needles in a Video Haystack | May 26, 2025 | Video UnderstandingVisual Grounding | —Unverified | 0 |
| AdaTP: Attention-Debiased Token Pruning for Video Large Language Models | May 26, 2025 | Video Understanding | —Unverified | 0 |
| TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos | May 26, 2025 | AttributeVideo Understanding | CodeCode Available | 0 |
| Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs | May 25, 2025 | Video Understanding | —Unverified | 0 |
| Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding | May 23, 2025 | FormQuestion Answering | —Unverified | 0 |
| SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding | May 22, 2025 | Action ClassificationAutomatic Speech Recognition | CodeCode Available | 0 |
| Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles | May 22, 2025 | EgoSchemaFew-Shot Learning | —Unverified | 0 |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning | May 22, 2025 | Misinformationreinforcement-learning | CodeCode Available | 1 |
| QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design | May 22, 2025 | CPUGPU | CodeCode Available | 2 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |
| Clapper: Compact Learning and Video Representation in VLMs | May 21, 2025 | Video Understanding | —Unverified | 0 |
| ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning | May 21, 2025 | Pseudo LabelReinforcement Learning (RL) | —Unverified | 0 |
| LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Leveraging Foundation Models for Multimodal Graph-Based Action Recognition | May 21, 2025 | Action RecognitionGraph Attention | —Unverified | 0 |
| A Challenge to Build Neuro-Symbolic Video Agents | May 20, 2025 | Scene ClassificationVideo Retrieval | CodeCode Available | 0 |
| Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models | May 20, 2025 | Video CompressionVideo Understanding | CodeCode Available | 2 |
| Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? | May 20, 2025 | Video Understanding | —Unverified | 0 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 |
| VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation | May 20, 2025 | MMEMultiple-choice | CodeCode Available | 4 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding | May 19, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations | May 18, 2025 | Video EditingVideo Understanding | —Unverified | 0 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation | May 13, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| Gameplay Highlights Generation | May 12, 2025 | Event DetectionHighlight Detection | —Unverified | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 |
| StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant | May 8, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph | May 6, 2025 | EgoSchemaRetrieval | —Unverified | 0 |
| Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection | May 5, 2025 | Anomaly DetectionAnomaly Detection In Surveillance Videos | CodeCode Available | 1 |
| VideoLLM Benchmarks and Evaluation: A Survey | May 3, 2025 | SurveyVideo Understanding | —Unverified | 0 |
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | May 2, 2025 | Anomaly DetectionCommon Sense Reasoning | CodeCode Available | 1 |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | May 2, 2025 | Dense CaptioningHighlight Detection | CodeCode Available | 1 |
| Empowering Agentic Video Analytics Systems with Video Language Models | May 1, 2025 | Knowledge GraphsRAG | —Unverified | 0 |
| SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding | Apr 30, 2025 | Video Understanding | CodeCode Available | 0 |