SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 351400 of 1149 papers

TitleStatusHype
Elaborative Rehearsal for Zero-shot Action RecognitionCode1
Free Lunch for Surgical Video Understanding by Distilling Self-SupervisionsCode1
MMAD: Multi-label Micro-Action Detection in VideosCode1
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeCode1
Spatial-Temporal Transformer for Dynamic Scene Graph GenerationCode1
Spatio-temporal Prompting Network for Robust Video Feature ExtractionCode1
SportsHHI: A Dataset for Human-Human Interaction Detection in Sports VideosCode1
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living ActivitiesCode1
Learning the Predictability of the FutureCode1
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
EgoTaskQA: Understanding Human Tasks in Egocentric VideosCode1
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video UnderstandingCode1
EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery VideosCode1
Leveraging triplet loss for unsupervised action segmentationCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMsCode1
Learning Temporally Causal Latent Processes from General Temporal DataCode1
Can An Image Classifier Suffice For Action Recognition?Code1
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video RepresentationCode1
Teaching VLMs to Localize Specific Objects from In-context ExamplesCode1
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action RecognitionCode1
Temporal Aggregate Representations for Long-Range Video UnderstandingCode1
Learning Temporally Latent Causal Processes from General Temporal DataCode1
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsCode1
Lightweight Network Architecture for Real-Time Action RecognitionCode1
TSM: Temporal Shift Module for Efficient Video UnderstandingCode1
Language Repository for Long Video UnderstandingCode1
Test of Time: Instilling Video-Language Models with a Sense of TimeCode1
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video RetrievalCode1
Learning Optical Flow with Adaptive Graph ReasoningCode1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action SegmentationCode1
An Empirical Study of End-to-End Temporal Action DetectionCode1
Language-Guided Audio-Visual Learning for Long-Term Sports AssessmentCode1
Learning Salient Boundary Feature for Anchor-free Temporal Action LocalizationCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal TokensCode1
Towards High-Quality Temporal Action Detection with Sparse ProposalsCode1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMsCode1
Towards Smooth Video CompositionCode1
Is Appearance Free Action Recognition Possible?Code1
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding ApproachCode1
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video UnderstandingCode1
Isolated Sign Recognition from RGB Video using Pose Flow and Self-AttentionCode1
HAT: History-Augmented Anchor Transformer for Online Temporal Action LocalizationCode1
CyberV: Cybernetics for Test-time Scaling in Video UnderstandingCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action RecognitionCode1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video ParsingCode1
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation ModelsCode1
Show:102550
← PrevPage 8 of 23Next →

No leaderboard results yet.