SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 301350 of 1149 papers

TitleStatusHype
EPIC Fields: Marrying 3D Geometry and Video UnderstandingCode1
Technical Report: Temporal Aggregate RepresentationsCode1
Long Movie Clip Classification with State-Space Video ModelsCode1
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMsCode1
CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot InteractionCode1
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and MitigationCode1
Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight DetectionCode1
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object SegmentationCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
Learning the Predictability of the FutureCode1
Isolated Sign Recognition from RGB Video using Pose Flow and Self-AttentionCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
Enhancing Self-supervised Video Representation Learning via Multi-level Feature OptimizationCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from VideosCode1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual AwarenessCode1
Fact-R1: Towards Explainable Video Misinformation Detection with Deep ReasoningCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric VideosCode1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Test of Time: Instilling Video-Language Models with a Sense of TimeCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
End-to-End Video Instance Segmentation with TransformersCode1
Streaming Video Temporal Action Segmentation In Real TimeCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
End-to-end Temporal Action Detection with TransformerCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
End-to-End Streaming Video Temporal Action Segmentation with Reinforce LearningCode1
FineAction: A Fine-Grained Video Dataset for Temporal Action LocalizationCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
End-to-End Referring Video Object Segmentation with Multimodal TransformersCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D ChallengesCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
CAMEL-Bench: A Comprehensive Arabic LMM BenchmarkCode1
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video UnderstandingCode1
MMAD: Multi-label Micro-Action Detection in VideosCode1
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?Code1
Compositional Video Understanding with Spatiotemporal Structure-based TransformersCode1
An overview on the evaluated video retrieval tasks at TRECVID 2022Code1
Stochastic Image-to-Video Synthesis using cINNsCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
Large Scale Holistic Video UnderstandingCode1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
FrameExit: Conditional Early Exiting for Efficient Video RecognitionCode1
A Comprehensive Study of Deep Video Action RecognitionCode1
Show:102550
← PrevPage 7 of 23Next →

No leaderboard results yet.