SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 801825 of 1149 papers

TitleStatusHype
Egocentric and Exocentric Methods: A Short Survey0
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling0
Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition0
Exploring Anchor-based Detection for Ego4D Natural Language Query0
Exploring Missing Modality in Multimodal Egocentric Datasets0
Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 20220
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding0
Extending Video Masked Autoencoders to 128 frames0
Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding0
Real-Time Segmentation Networks should be Latency Aware0
Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning0
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models0
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Fine-Grain Annotation of Cricket Videos0
Fine-Grained Video Captioning through Scene Graph Consolidation0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
Flatten: Video Action Recognition is an Image Classification task0
Flexible Frame Selection for Efficient Video Reasoning0
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering0
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions0
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles0
Show:102550
← PrevPage 33 of 46Next →

No leaderboard results yet.