SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 201250 of 1149 papers

TitleStatusHype
Action Scene Graphs for Long-Form Understanding of Egocentric VideosCode1
Is Appearance Free Action Recognition Possible?Code1
PhysGame: Uncovering Physical Commonsense Violations in Gameplay VideosCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Agentic Keyframe Search for Video Question AnsweringCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Leveraging triplet loss for unsupervised action segmentationCode1
PreFM: Online Audio-Visual Event Parsing via Predictive Future ModelingCode1
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language ModelsCode1
Crossover Learning for Fast Online Video Instance SegmentationCode1
DEVIAS: Learning Disentangled Video Representations of Action and SceneCode1
Panoptic Video Scene Graph GenerationCode1
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric VideosCode1
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?Code1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual SegmentationCode1
Disentangle Your Dense Object DetectorCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
DisTime: Distribution-based Time Representation for Video Large Language ModelsCode1
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action DetectionCode1
Learning Temporally Causal Latent Processes from General Temporal DataCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
Open-Vocabulary Video Relation ExtractionCode1
Panoramic Vision Transformer for Saliency Detection in 360° VideosCode1
Do Language Models Understand Time?Code1
Large Scale Holistic Video UnderstandingCode1
Domain Knowledge-Informed Self-Supervised Representations for Workout Form AssessmentCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video RepresentationCode1
Contrastive Masked Autoencoders for Self-Supervised Video HashingCode1
PAN: Towards Fast Action Recognition via Learning Persistence of AppearanceCode1
REVECA -- Rich Encoder-decoder framework for Video Event CAptionerCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
BT-Adapter: Video Conversation is Feasible Without Video Instruction TuningCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
Grounded Question-Answering in Long Egocentric VideosCode1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
Dual-path Adaptation from Image to Video TransformersCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningCode1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud VideosCode1
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living ActivitiesCode1
Object-Region Video TransformersCode1
Compositional Video Understanding with Spatiotemporal Structure-based TransformersCode1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringCode1
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.