SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 151200 of 1149 papers

TitleStatusHype
Occluded Video Instance Segmentation: A BenchmarkCode1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingCode1
No Time to Waste: Squeeze Time into Channel for Mobile Video UnderstandingCode1
Isolated Sign Recognition from RGB Video using Pose Flow and Self-AttentionCode1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally ActionsCode1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action SegmentationCode1
NExT-QA: Next Phase of Question-Answering to Explaining Temporal ActionsCode1
Actor-Context-Actor Relation Network for Spatio-Temporal Action LocalizationCode1
A Multigrid Method for Efficiently Training Video ModelsCode1
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D ChallengesCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment GroundingCode1
IntentVizor: Towards Generic Query Guided Interactive Video SummarizationCode1
Is Appearance Free Action Recognition Possible?Code1
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action DetectionCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual SegmentationCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
Multimodal Distillation for Egocentric Action RecognitionCode1
Mug-STAN: Adapting Image-Language Pretrained Models for General Video UnderstandingCode1
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual ActionsCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
Multimodal Long Video Modeling Based on Temporal Dynamic ContextCode1
AutoVideo: An Automated Video Action Recognition SystemCode1
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?Code1
Action Scene Graphs for Long-Form Understanding of Egocentric VideosCode1
Large Scale Holistic Video UnderstandingCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity ParsingCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation LearningCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
Language Repository for Long Video UnderstandingCode1
Agentic Keyframe Search for Video Question AnsweringCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
Modeling Video As Stochastic Processes for Fine-Grained Video Representation LearningCode1
MotionSqueeze: Neural Motion Feature Learning for Video UnderstandingCode1
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports ActionsCode1
Crossover Learning for Fast Online Video Instance SegmentationCode1
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living ActivitiesCode1
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringCode1
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
Show:102550
← PrevPage 4 of 23Next →

No leaderboard results yet.