| Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles | May 22, 2025 | EgoSchemaFew-Shot Learning | —Unverified | 0 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |
| Leveraging Foundation Models for Multimodal Graph-Based Action Recognition | May 21, 2025 | Action RecognitionGraph Attention | —Unverified | 0 |
| ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning | May 21, 2025 | Pseudo LabelReinforcement Learning (RL) | —Unverified | 0 |
| LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Clapper: Compact Learning and Video Representation in VLMs | May 21, 2025 | Video Understanding | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| A Challenge to Build Neuro-Symbolic Video Agents | May 20, 2025 | Scene ClassificationVideo Retrieval | CodeCode Available | 0 |
| Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? | May 20, 2025 | Video Understanding | —Unverified | 0 |
| Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding | May 19, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations | May 18, 2025 | Video EditingVideo Understanding | —Unverified | 0 |
| SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation | May 13, 2025 | Computational EfficiencyVideo Understanding | —Unverified | 0 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| Gameplay Highlights Generation | May 12, 2025 | Event DetectionHighlight Detection | —Unverified | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 |
| StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant | May 8, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph | May 6, 2025 | EgoSchemaRetrieval | —Unverified | 0 |
| VideoLLM Benchmarks and Evaluation: A Survey | May 3, 2025 | SurveyVideo Understanding | —Unverified | 0 |
| Empowering Agentic Video Analytics Systems with Video Language Models | May 1, 2025 | Knowledge GraphsRAG | —Unverified | 0 |
| SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding | Apr 30, 2025 | Video Understanding | CodeCode Available | 0 |
| TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation | Apr 24, 2025 | Caption GenerationDense Video Captioning | —Unverified | 0 |
| DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs | Apr 23, 2025 | Token ReductionVideo Understanding | —Unverified | 0 |
| An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes | Apr 21, 2025 | MMEVideo MME | —Unverified | 0 |
| Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection | Apr 20, 2025 | Action DetectionDecoder | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Apr 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | Apr 19, 2025 | Video Understanding | —Unverified | 0 |
| Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval | Apr 17, 2025 | Partially Relevant Video RetrievalRetrieval | —Unverified | 0 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild | Apr 15, 2025 | SegmentationSemantic Segmentation | —Unverified | 0 |
| OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Apr 15, 2025 | Semantic SegmentationVideo Generation | —Unverified | 0 |
| Mavors: Multi-granularity Video Representation for Multimodal Large Language Model | Apr 14, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking | Apr 11, 2025 | Moment RetrievalQuestion Answering | —Unverified | 0 |
| SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding | Apr 10, 2025 | Video Understanding | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding | Apr 10, 2025 | Instruction FollowingVideo Understanding | —Unverified | 0 |
| From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction | Apr 8, 2025 | Game State ReconstructionJersey Number Recognition | —Unverified | 0 |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models | Apr 8, 2025 | In-Context LearningInstruction Following | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval | Apr 3, 2025 | Information RetrievalRepresentation Learning | —Unverified | 0 |
| Moment Quantization for Video Temporal Grounding | Apr 3, 2025 | QuantizationVideo Understanding | —Unverified | 0 |
| TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Apr 2, 2025 | Action RecognitionAll | —Unverified | 0 |
| Aligned Better, Listen Better for Audio-Visual Large Language Models | Apr 2, 2025 | Video Understanding | —Unverified | 0 |
| DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Mar 31, 2025 | Video DescriptionVideo Understanding | —Unverified | 0 |
| H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding | Mar 31, 2025 | Video Understanding | —Unverified | 0 |
| CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition | Mar 30, 2025 | Action ClassificationAction Recognition | —Unverified | 0 |
| OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | Mar 29, 2025 | Streaming video understandingVideo Understanding | —Unverified | 0 |
| Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding | Mar 26, 2025 | GPUQuestion Answering | —Unverified | 0 |