SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 701750 of 1149 papers

TitleStatusHype
Extending Video Masked Autoencoders to 128 frames0
Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding0
Real-Time Segmentation Networks should be Latency Aware0
Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning0
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models0
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Fine-Grain Annotation of Cricket Videos0
Fine-Grained Video Captioning through Scene Graph Consolidation0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
Flatten: Video Action Recognition is an Image Classification task0
Flexible Frame Selection for Efficient Video Reasoning0
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering0
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions0
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles0
Frame-Voyager: Learning to Query Frames for Video Large Language Models0
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models0
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction0
From Image to Video, what do we need in multimodal LLMs?0
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations0
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment0
Fully Automated Hand Hygiene Monitoring\ Operating Room using 3D Convolutional Neural Network0
Future semantic segmentation of time-lapsed videos with large temporal displacement0
Gameplay Highlights Generation0
Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention0
Generating the Future With Adversarial Transformers0
Generating Videos with Scene Dynamics0
Generative Frame Sampler for Long Video Understanding0
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning0
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning0
Global Motion Understanding in Large-Scale Video Object Segmentation0
Global Self-Attention Networks0
Global Self-Attention Networks for Image Recognition0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation0
Gradient Frequency Modulation for Visually Explaining Video Understanding Models0
GraphVid: It Only Takes a Few Nodes to Understand a Video0
Grounded Objects and Interactions for Video Captioning0
Grounded Video Situation Recognition0
Grounding Action Descriptions in Videos0
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection0
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning0
GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement0
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding0
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models0
Harnessing Object and Scene Semantics for Large-Scale Video Understanding0
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions0
Show:102550
← PrevPage 15 of 23Next →

No leaderboard results yet.