SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 201250 of 1149 papers

TitleStatusHype
Learning Video Context as Interleaved Multimodal SequencesCode1
MotionSqueeze: Neural Motion Feature Learning for Video UnderstandingCode1
MM-VID: Advancing Video Understanding with GPT-4V(ision)Code1
A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action DetectorCode1
MMAD: Multi-label Micro-Action Detection in VideosCode1
EgoTaskQA: Understanding Human Tasks in Egocentric VideosCode1
Agentic Keyframe Search for Video Question AnsweringCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment GroundingCode1
Multimodal Long Video Modeling Based on Temporal Dynamic ContextCode1
DEVIAS: Learning Disentangled Video Representations of Action and SceneCode1
NExT-QA: Next Phase of Question-Answering to Explaining Temporal ActionsCode1
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric VideosCode1
Panoramic Vision Transformer for Saliency Detection in 360° VideosCode1
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language ModelsCode1
Crossover Learning for Fast Online Video Instance SegmentationCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
DisTime: Distribution-based Time Representation for Video Large Language ModelsCode1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss AlpsCode1
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video RepresentationCode1
Contrastive Masked Autoencoders for Self-Supervised Video HashingCode1
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?Code1
Do Language Models Understand Time?Code1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
Long Movie Clip Classification with State-Space Video ModelsCode1
Lightweight Network Architecture for Real-Time Action RecognitionCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Learning the Predictability of the FutureCode1
Procedure-Aware Pretraining for Instructional Video UnderstandingCode1
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary DetectionCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
Leveraging triplet loss for unsupervised action segmentationCode1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video UnderstandingCode1
Event-Free Moving Object Segmentation from Moving Ego VehicleCode1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
Dual-path Adaptation from Image to Video TransformersCode1
Learning Optical Flow with Adaptive Graph ReasoningCode1
Relational Self-Attention: What's Missing in Attention for Video UnderstandingCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
REVECA -- Rich Encoder-decoder framework for Video Event CAptionerCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
Compositional Video Understanding with Spatiotemporal Structure-based TransformersCode1
Language Repository for Long Video UnderstandingCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.