SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 201250 of 1149 papers

TitleStatusHype
BEARCUBS: A benchmark for computer-using web agents0
ALLVB: All-in-One Long Video Understanding Benchmark0
Towards Fine-Grained Video Question Answering0
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long VideosCode1
Unified Reward Model for Multimodal Understanding and GenerationCode4
EgoLife: Towards Egocentric Life AssistantCode3
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection0
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models0
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos0
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action DetectionCode3
M-LLM Based Video Frame Selection for Efficient Video Understanding0
InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model0
An Analysis of Data Transformation Effects on Segment Anything 20
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric VideosCode1
Fine-Grained Video Captioning through Scene Graph Consolidation0
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models0
AVD2: Accident Video Diffusion for Accident Video Description0
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval0
iMOVE: Instance-Motion-Aware Video Understanding0
VRoPE: Rotary Position Embedding for Video Large Language ModelsCode1
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language ModelCode1
Semantics-aware Test-time Adaptation for 3D Human Pose Estimation0
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingCode2
Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering0
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis0
A Survey on Mamba Architecture for Vision Applications0
CoS: Chain-of-Shot Prompting for Long Video Understanding0
A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems0
VideoRoPE: What Makes for Good Video Rotary Position Embedding?Code3
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayCode3
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs0
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding0
A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions0
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models0
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context VideosCode7
AIN: The Arabic INclusive Large Multimodal ModelCode2
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding0
Understanding Long Videos via LLM-Powered Entity Relation Graphs0
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video UnderstandingCode2
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding0
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced KnowledgeCode2
Temporal Preference Optimization for Long-Form Video Understanding0
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video UnderstandingCode5
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context ModelingCode0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelCode0
MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingCode2
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.