SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 151175 of 1149 papers

TitleStatusHype
MMAD: Multi-label Micro-Action Detection in VideosCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
MH-DETR: Video Moment and Highlight Detection with Cross-modal TransformerCode1
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?Code1
Large Scale Holistic Video UnderstandingCode1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action DetectorCode1
MM-VID: Advancing Video Understanding with GPT-4V(ision)Code1
Actor-Context-Actor Relation Network for Spatio-Temporal Action LocalizationCode1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
A Multigrid Method for Efficiently Training Video ModelsCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss AlpsCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action DetectionCode1
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual SegmentationCode1
Show:102550
← PrevPage 7 of 46Next →

No leaderboard results yet.