SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 151200 of 1149 papers

TitleStatusHype
Occluded Video Instance Segmentation: A BenchmarkCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
NExT-QA: Next Phase of Question-Answering to Explaining Temporal ActionsCode1
No Time to Waste: Squeeze Time into Channel for Mobile Video UnderstandingCode1
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?Code1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment GroundingCode1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
Actor-Context-Actor Relation Network for Spatio-Temporal Action LocalizationCode1
Multimodal Distillation for Egocentric Action RecognitionCode1
Mug-STAN: Adapting Image-Language Pretrained Models for General Video UnderstandingCode1
A Multigrid Method for Efficiently Training Video ModelsCode1
Multimodal Long Video Modeling Based on Temporal Dynamic ContextCode1
MotionSqueeze: Neural Motion Feature Learning for Video UnderstandingCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
Modeling Video As Stochastic Processes for Fine-Grained Video Representation LearningCode1
Do Language Models Understand Time?Code1
MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity ParsingCode1
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports ActionsCode1
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action DetectionCode1
MM-VID: Advancing Video Understanding with GPT-4V(ision)Code1
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual SegmentationCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in VideosCode1
A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action DetectorCode1
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual ActionsCode1
MH-DETR: Video Moment and Highlight Detection with Cross-modal TransformerCode1
MMAD: Multi-label Micro-Action Detection in VideosCode1
AutoVideo: An Automated Video Action Recognition SystemCode1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
Action Scene Graphs for Long-Form Understanding of Egocentric VideosCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss AlpsCode1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
Agentic Keyframe Search for Video Question AnsweringCode1
Long Movie Clip Classification with State-Space Video ModelsCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
Crossover Learning for Fast Online Video Instance SegmentationCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
Learning Temporally Causal Latent Processes from General Temporal DataCode1
Learning the Predictability of the FutureCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Leveraging triplet loss for unsupervised action segmentationCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
Show:102550
← PrevPage 4 of 23Next →

No leaderboard results yet.