SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 101125 of 1149 papers

TitleStatusHype
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?Code2
Foundation Models for Video Understanding: A SurveyCode2
AIN: The Arabic INclusive Large Multimodal ModelCode2
OmniVid: A Generative Framework for Universal Video UnderstandingCode2
ST-LLM: Large Language Models Are Effective Temporal LearnersCode2
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video UnderstandingCode2
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and FutureCode2
Omni-Video: Democratizing Unified Video Understanding and GenerationCode2
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
Temporal Action Segmentation: An Analysis of Modern TechniquesCode2
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video UnderstandingCode2
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language ModelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-ConquerCode2
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormerCode2
Beyond MOT: Semantic Multi-Object TrackingCode2
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AICode2
PyTorchVideo: A Deep Learning Library for Video UnderstandingCode2
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long VideosCode2
MMVU: Measuring Expert-Level Multi-Discipline Video UnderstandingCode2
A Content-Driven Micro-Video Recommendation Dataset at ScaleCode2
Mobile-VideoGPT: Fast and Accurate Video Understanding Language ModelCode2
Show:102550
← PrevPage 5 of 46Next →

No leaderboard results yet.