SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 251275 of 1149 papers

TitleStatusHype
Compositional Video Understanding with Spatiotemporal Structure-based TransformersCode1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationCode1
Streaming Video ModelCode1
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional PropertiesCode1
Long Movie Clip Classification with State-Space Video ModelsCode1
Lightweight Network Architecture for Real-Time Action RecognitionCode1
Leveraging triplet loss for unsupervised action segmentationCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
Learning Temporally Causal Latent Processes from General Temporal DataCode1
Learning the Predictability of the FutureCode1
Enhancing Self-supervised Video Representation Learning via Multi-level Feature OptimizationCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action RecognitionCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space ModelsCode1
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports ActionsCode1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action SegmentationCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
Language Repository for Long Video UnderstandingCode1
Is Appearance Free Action Recognition Possible?Code1
A Simple LLM Framework for Long-Range Video Question-AnsweringCode1
Show:102550
← PrevPage 11 of 46Next →

No leaderboard results yet.