SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 2650 of 1149 papers

TitleStatusHype
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster DistillationCode1
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingCode2
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person ScenariosCode0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding0
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding0
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis0
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models0
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric VisionCode0
TextVidBench: A Benchmark for Long Video Scene Text Understanding0
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval0
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
METok: Multi-Stage Event-based Token Compression for Efficient Long Video UnderstandingCode0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
InterRVOS: Interaction-aware Referring Video Object Segmentation0
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding0
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data EfficiencyCode2
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis0
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
Show:102550
← PrevPage 2 of 46Next →

No leaderboard results yet.