SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 376400 of 1149 papers

TitleStatusHype
TSM: Temporal Shift Module for Efficient Video UnderstandingCode1
Language Repository for Long Video UnderstandingCode1
Test of Time: Instilling Video-Language Models with a Sense of TimeCode1
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video RetrievalCode1
Learning Optical Flow with Adaptive Graph ReasoningCode1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action SegmentationCode1
An Empirical Study of End-to-End Temporal Action DetectionCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal TokensCode1
Towards High-Quality Temporal Action Detection with Sparse ProposalsCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
Towards Smooth Video CompositionCode1
Is Appearance Free Action Recognition Possible?Code1
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding ApproachCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
Isolated Sign Recognition from RGB Video using Pose Flow and Self-AttentionCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation ModelsCode1
Show:102550
← PrevPage 16 of 46Next →

No leaderboard results yet.