| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos | Apr 24, 2025 | MMEVideo MME | CodeCode Available | 3 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs | Apr 23, 2025 | Token ReductionVideo Understanding | —Unverified | 0 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | Apr 21, 2025 | MMEVideo MME | CodeCode Available | 4 |
| An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes | Apr 21, 2025 | MMEVideo MME | —Unverified | 0 |
| Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection | Apr 20, 2025 | Action DetectionDecoder | —Unverified | 0 |
| OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | Apr 19, 2025 | Video Understanding | —Unverified | 0 |
| VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models | Apr 17, 2025 | HallucinationVideo Understanding | CodeCode Available | 1 |
| Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval | Apr 17, 2025 | Partially Relevant Video RetrievalRetrieval | —Unverified | 0 |
| Perception Encoder: The best visual embeddings are not at the output of the network | Apr 17, 2025 | Depth EstimationLanguage Modeling | CodeCode Available | 8 |
| PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | Apr 17, 2025 | Video Question AnsweringVideo Understanding | CodeCode Available | 7 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Apr 15, 2025 | Semantic SegmentationVideo Generation | —Unverified | 0 |
| PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild | Apr 15, 2025 | SegmentationSemantic Segmentation | —Unverified | 0 |
| Mavors: Multi-granularity Video Representation for Multimodal Large Language Model | Apr 14, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context | Apr 14, 2025 | Video Understanding | CodeCode Available | 1 |
| TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning | Apr 13, 2025 | Question Answeringreinforcement-learning | CodeCode Available | 2 |
| F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos | Apr 11, 2025 | Action UnderstandingEvent Detection | CodeCode Available | 1 |
| Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking | Apr 11, 2025 | Moment RetrievalQuestion Answering | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding | Apr 10, 2025 | Instruction FollowingVideo Understanding | —Unverified | 0 |
| SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding | Apr 10, 2025 | Video Understanding | —Unverified | 0 |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | Apr 9, 2025 | MVBenchObject Tracking | CodeCode Available | 3 |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models | Apr 8, 2025 | In-Context LearningInstruction Following | —Unverified | 0 |
| From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction | Apr 8, 2025 | Game State ReconstructionJersey Number Recognition | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Re-thinking Temporal Search for Long-Form Video Understanding | Apr 3, 2025 | Computational EfficiencyForm | CodeCode Available | 2 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 |
| Moment Quantization for Video Temporal Grounding | Apr 3, 2025 | QuantizationVideo Understanding | —Unverified | 0 |
| Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Apr 3, 2025 | Computational EfficiencyGPU | CodeCode Available | 2 |
| Aligned Better, Listen Better for Audio-Visual Large Language Models | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Apr 2, 2025 | Action RecognitionAll | —Unverified | 0 |
| TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| SpaceR: Reinforcing MLLMs in Video Spatial Reasoning | Apr 2, 2025 | MMESpatial Reasoning | CodeCode Available | 2 |
| Slow-Fast Architecture for Video Multi-Modal Large Language Models | Apr 2, 2025 | Video Understanding | CodeCode Available | 1 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Mar 31, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 2 |
| H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding | Mar 31, 2025 | Video Understanding | —Unverified | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 |
| CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | Mar 30, 2025 | Action ClassificationAction Recognition | —Unverified | 0 |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | Mar 29, 2025 | Streaming video understandingVideo Understanding | —Unverified | 0 |
| BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding | Mar 27, 2025 | FormLanguage Modeling | CodeCode Available | 1 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 |
| From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment | Mar 26, 2025 | Video Understanding | —Unverified | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 |
| Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations | Mar 25, 2025 | Representation LearningVideo Understanding | CodeCode Available | 0 |