SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 4150 of 1149 papers

TitleStatusHype
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action DetectionCode3
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayCode3
VideoRoPE: What Makes for Good Video Rotary Position Embedding?Code3
Valley2: Exploring Multimodal Models with Scalable Vision-Language DesignCode3
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMCode3
VisionZip: Longer is Better but Not Necessary in Vision Language ModelsCode3
Towards Universal Soccer Video UnderstandingCode3
Video-RAG: Visually-aligned Retrieval-Augmented Long Video ComprehensionCode3
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language UnderstandingCode3
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model InferenceCode3
Show:102550
← PrevPage 5 of 115Next →

No leaderboard results yet.