SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 150 of 1149 papers

TitleStatusHype
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding0
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New BenchmarksCode1
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI0
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments0
Omni-Video: Democratizing Unified Video Understanding and GenerationCode2
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models0
Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation0
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges0
Kwai Keye-VL Technical ReportCode4
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningCode7
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs0
Flash-VStream: Efficient Real-Time Understanding for Long Video StreamsCode3
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence AlignmentCode0
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs0
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMsCode2
IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes0
Task-Aware KV Compression For Cost-Effective Long Video UnderstandingCode0
PEVLM: Parallel Encoding for Vision-Language Models0
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning0
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding0
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric OptimizationCode0
MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models0
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video UnderstandingCode0
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster DistillationCode1
VideoDeepResearch: Long Video Understanding With Agentic Tool UsingCode2
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person ScenariosCode0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding0
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding0
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis0
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models0
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric VisionCode0
TextVidBench: A Benchmark for Long Video Scene Text Understanding0
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval0
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
METok: Multi-Stage Event-based Token Compression for Efficient Long Video UnderstandingCode0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
InterRVOS: Interaction-aware Referring Video Object Segmentation0
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding0
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data EfficiencyCode2
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis0
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
Show:102550
← PrevPage 1 of 23Next →

No leaderboard results yet.