SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 51100 of 1149 papers

TitleStatusHype
VUDG: A Dataset for Video Understanding Domain Generalization0
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders0
SiLVR: A Simple Language-based Video Reasoning FrameworkCode1
Learning reusable concepts across different egocentric video understanding tasks0
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD SoftwareCode1
Time Blindness: Why Video-Language Models Can't See What Humans Can?0
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding0
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection0
ScaleLong: A Multi-Timescale Benchmark for Long Video UnderstandingCode0
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation ModelsCode2
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?Code1
PreFM: Online Audio-Visual Event Parsing via Predictive Future ModelingCode1
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
Universal Visuo-Tactile Video Understanding for Embodied Interaction0
VidText: Towards Comprehensive Evaluation for Video Text UnderstandingCode1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment GroundingCode1
Two Causally Related Needles in a Video Haystack0
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models0
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic VideosCode0
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs0
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding0
SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game UnderstandingCode0
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles0
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningCode1
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-DesignCode2
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding ValidationCode0
Clapper: Compact Learning and Video Representation in VLMs0
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning0
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval0
Leveraging Foundation Models for Multimodal Graph-Based Action Recognition0
A Challenge to Build Neuro-Symbolic Video AgentsCode0
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language ModelsCode2
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?0
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
VideoEval-Pro: Robust and Realistic Long Video Understanding EvaluationCode4
Domain Adaptation of VLM for Soccer Video Understanding0
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video UnderstandingCode0
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations0
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation0
Gameplay Highlights Generation0
Seed1.5-VL Technical Report0
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant0
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph0
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly DetectionCode1
VideoLLM Benchmarks and Evaluation: A Survey0
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video UnderstandingCode1
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in ActionCode1
Empowering Agentic Video Analytics Systems with Video Language Models0
SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingCode0
Show:102550
← PrevPage 2 of 23Next →

No leaderboard results yet.