SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 226250 of 1149 papers

TitleStatusHype
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis0
A Survey on Mamba Architecture for Vision Applications0
CoS: Chain-of-Shot Prompting for Long Video Understanding0
A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems0
VideoRoPE: What Makes for Good Video Rotary Position Embedding?Code3
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayCode3
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs0
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding0
A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions0
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models0
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context VideosCode7
AIN: The Arabic INclusive Large Multimodal ModelCode2
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding0
Understanding Long Videos via LLM-Powered Entity Relation Graphs0
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video UnderstandingCode2
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding0
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced KnowledgeCode2
Temporal Preference Optimization for Long-Form Video Understanding0
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video UnderstandingCode5
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model0
MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingCode2
Show:102550
← PrevPage 10 of 46Next →

No leaderboard results yet.