SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 101125 of 1149 papers

TitleStatusHype
VideoMultiAgents: A Multi-Agent Framework for Video Question AnsweringCode1
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming VideosCode3
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs0
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language ModelsCode4
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task0
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding0
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?0
PerceptionLM: Open-Access Data and Models for Detailed Visual UnderstandingCode7
Perception Encoder: The best visual embeddings are not at the output of the networkCode8
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval0
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video ModelsCode1
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization0
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding0
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild0
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model0
Multimodal Long Video Modeling Based on Temporal Dynamic ContextCode1
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video ReasoningCode2
F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from VideosCode1
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking0
How Can Objects Help Video-Language Understanding?0
Show:102550
← PrevPage 5 of 46Next →

No leaderboard results yet.