SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 251275 of 1149 papers

TitleStatusHype
REVECA -- Rich Encoder-decoder framework for Video Event CAptionerCode1
EEV: A Large-Scale Dataset for Studying Evoked Expressions from VideoCode1
Object-Region Video TransformersCode1
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional PropertiesCode1
SFMViT: SlowFast Meet ViT in Chaotic WorldCode1
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video UnderstandingCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
Compositional Video Understanding with Spatiotemporal Structure-based TransformersCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
QuerYD: A video dataset with high-quality text and audio narrationsCode1
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?Code1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task PerspectivesCode1
Prompting Visual-Language Models for Efficient Video UnderstandingCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
Large Scale Holistic Video UnderstandingCode1
Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary DetectionCode1
Relational Self-Attention: What's Missing in Attention for Video UnderstandingCode1
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
Show:102550
← PrevPage 11 of 46Next →

No leaderboard results yet.