SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 151175 of 1149 papers

TitleStatusHype
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action DetectorCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss AlpsCode1
MH-DETR: Video Moment and Highlight Detection with Cross-modal TransformerCode1
MM-VID: Advancing Video Understanding with GPT-4V(ision)Code1
Actor-Context-Actor Relation Network for Spatio-Temporal Action LocalizationCode1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
A Multigrid Method for Efficiently Training Video ModelsCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action DetectionCode1
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual SegmentationCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
Lightweight Network Architecture for Real-Time Action RecognitionCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Learning the Predictability of the FutureCode1
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual ActionsCode1
Show:102550
← PrevPage 7 of 46Next →

No leaderboard results yet.