SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 426450 of 1149 papers

TitleStatusHype
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval0
TextVidBench: A Benchmark for Long Video Scene Text Understanding0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
METok: Multi-Stage Event-based Token Compression for Efficient Long Video UnderstandingCode0
InterRVOS: Interaction-aware Referring Video Object Segmentation0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding0
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding0
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis0
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders0
Learning reusable concepts across different egocentric video understanding tasks0
VUDG: A Dataset for Video Understanding Domain Generalization0
Time Blindness: Why Video-Language Models Can't See What Humans Can?0
ScaleLong: A Multi-Timescale Benchmark for Long Video UnderstandingCode0
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection0
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding0
Universal Visuo-Tactile Video Understanding for Embodied Interaction0
Two Causally Related Needles in a Video Haystack0
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic VideosCode0
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models0
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs0
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding0
SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game UnderstandingCode0
Show:102550
← PrevPage 18 of 46Next →

No leaderboard results yet.