SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 451500 of 1149 papers

TitleStatusHype
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation ModelCode0
Rethinking Image-to-Video Adaptation: An Object-centric Perspective0
Video-STaR: Self-Training Enables Video Instruction Tuning with Any SupervisionCode2
MMAD: Multi-label Micro-Action Detection in VideosCode1
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding0
KeyVideoLLM: Towards Large-scale Video Keyframe Selection0
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output0
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs0
https://arxiv.org/abs/2407.00634Code0
Tarsier: Recipes for Training and Evaluating Large Video Description ModelsCode4
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
Snakes and Ladders: Two Steps Up for VideoMambaCode1
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and UnderstandingCode5
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across HeadsCode1
Zero-Shot Long-Form Video Understanding through Screenplay0
PVUW 2024 Challenge on Complex Video Understanding: Methods and ResultsCode4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models0
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-ConquerCode2
video-SALMONN: Speech-Enhanced Audio-Visual Large Language ModelsCode0
Towards Event-oriented Long Video UnderstandingCode1
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding0
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset0
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement0
DrVideo: Document Retrieval Based Long Video Understanding0
Slot State Space ModelsCode1
Hallucination Mitigation Prompts Long-term Video UnderstandingCode0
VideoVista: A Versatile Benchmark for Video Understanding and ReasoningCode1
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal ModelCode0
Localizing Events in Videos with Multimodal Queries0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living0
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsCode2
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsCode3
LVBench: An Extreme Long Video Understanding BenchmarkCode2
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in VideosCode1
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models0
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD0
Vript: A Video Is Worth Thousands of WordsCode2
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation0
Semantic Segmentation on VSPW Dataset through Masked Video Consistency0
ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsCode5
3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation0
MLVU: Benchmarking Multi-task Long Video UnderstandingCode3
Contrastive Language Video Time Pre-training0
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric VideosCode1
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation0
Show:102550
← PrevPage 10 of 23Next →

No leaderboard results yet.