| Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder | Jun 28, 2025 | Image SegmentationLarge Language Model | CodeCode Available | 1 |
| LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs | Jun 27, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Jun 18, 2025 | Audio captioningLarge Language Model | CodeCode Available | 2 |
| CogStream: Context-guided Streaming Video Question Answering | Jun 12, 2025 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | Jun 11, 2025 | Action AnticipationLarge Language Model | CodeCode Available | 7 |
| CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models | Jun 11, 2025 | counterfactualDescriptive | CodeCode Available | 2 |
| Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning | Jun 9, 2025 | Future predictionQuestion Answering | CodeCode Available | 0 |
| EgoVLM: Policy Optimization for Egocentric Video Understanding | Jun 3, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 |
| Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | May 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | May 29, 2025 | Question AnsweringVideo Generation | CodeCode Available | 0 |
| ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation | May 21, 2025 | Decision MakingLanguage Modeling | CodeCode Available | 0 |
| LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models | May 19, 2025 | Causal InferenceDecision Making | —Unverified | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 |
| Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | May 16, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge | May 11, 2025 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 |
| VideoMultiAgents: A Multi-Agent Framework for Video Question Answering | Apr 25, 2025 | Caption GenerationEgoSchema | CodeCode Available | 1 |
| Towards Understanding Camera Motions in Any Video | Apr 21, 2025 | Question AnsweringText Retrieval | —Unverified | 0 |
| PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | Apr 17, 2025 | Video Question AnsweringVideo Understanding | CodeCode Available | 7 |
| Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization | Apr 16, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| How Can Objects Help Video-Language Understanding? | Apr 10, 2025 | Image CaptioningObject | —Unverified | 0 |
| Advancing Egocentric Video Question Answering with Multimodal Large Language Models | Apr 6, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |