SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 451475 of 1149 papers

TitleStatusHype
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation ModelCode0
Rethinking Image-to-Video Adaptation: An Object-centric Perspective0
Video-STaR: Self-Training Enables Video Instruction Tuning with Any SupervisionCode2
MMAD: Multi-label Micro-Action Detection in VideosCode1
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding0
KeyVideoLLM: Towards Large-scale Video Keyframe Selection0
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output0
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs0
https://arxiv.org/abs/2407.00634Code0
Tarsier: Recipes for Training and Evaluating Large Video Description ModelsCode4
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and UnderstandingCode5
Snakes and Ladders: Two Steps Up for VideoMambaCode1
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across HeadsCode1
Zero-Shot Long-Form Video Understanding through Screenplay0
PVUW 2024 Challenge on Complex Video Understanding: Methods and ResultsCode4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models0
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-ConquerCode2
video-SALMONN: Speech-Enhanced Audio-Visual Large Language ModelsCode0
Towards Event-oriented Long Video UnderstandingCode1
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding0
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset0
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement0
DrVideo: Document Retrieval Based Long Video Understanding0
Show:102550
← PrevPage 19 of 46Next →

No leaderboard results yet.