SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 701725 of 1149 papers

TitleStatusHype
Extending Video Masked Autoencoders to 128 frames0
Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding0
Real-Time Segmentation Networks should be Latency Aware0
Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning0
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models0
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Fine-Grain Annotation of Cricket Videos0
Fine-Grained Video Captioning through Scene Graph Consolidation0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
Flatten: Video Action Recognition is an Image Classification task0
Flexible Frame Selection for Efficient Video Reasoning0
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering0
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions0
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles0
Frame-Voyager: Learning to Query Frames for Video Large Language Models0
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models0
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction0
From Image to Video, what do we need in multimodal LLMs?0
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations0
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment0
Fully Automated Hand Hygiene Monitoring\ Operating Room using 3D Convolutional Neural Network0
Show:102550
← PrevPage 29 of 46Next →

No leaderboard results yet.