SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 451500 of 1149 papers

TitleStatusHype
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles0
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding ValidationCode0
Leveraging Foundation Models for Multimodal Graph-Based Action Recognition0
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning0
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval0
Clapper: Compact Learning and Video Representation in VLMs0
Domain Adaptation of VLM for Soccer Video Understanding0
A Challenge to Build Neuro-Symbolic Video AgentsCode0
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?0
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video UnderstandingCode0
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations0
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation0
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language ModelsCode0
Gameplay Highlights Generation0
Seed1.5-VL Technical Report0
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant0
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph0
VideoLLM Benchmarks and Evaluation: A Survey0
Empowering Agentic Video Analytics Systems with Video Language Models0
SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingCode0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs0
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes0
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task0
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding0
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?0
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval0
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization0
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild0
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding0
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model0
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking0
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding0
How Can Objects Help Video-Language Understanding?0
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding0
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction0
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models0
InstructionBench: An Instructional Video Understanding Benchmark0
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval0
Moment Quantization for Video Temporal Grounding0
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding0
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?0
Aligned Better, Listen Better for Audio-Visual Large Language Models0
DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description0
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding0
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition0
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts0
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding0
Show:102550
← PrevPage 10 of 23Next →

No leaderboard results yet.