Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 1149 papers

Title	Date	Tasks	Status	Hype
Learning Video Context as Interleaved Multimodal Sequences	Jul 31, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	Jul 20, 2020	Action ClassificationAction Recognition	CodeCode Available	1
MM-VID: Advancing Video Understanding with GPT-4V(ision)	Oct 30, 2023	Script GenerationVideo Understanding	CodeCode Available	1
A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector	Jun 7, 2022	Action ClassificationAction Detection	CodeCode Available	1
MMAD: Multi-label Micro-Action Detection in Videos	Jul 7, 2024	Action AnalysisAction Detection	CodeCode Available	1
EgoTaskQA: Understanding Human Tasks in Egocentric Videos	Oct 8, 2022	Action Localizationcounterfactual	CodeCode Available	1
Agentic Keyframe Search for Video Question Answering	Mar 20, 2025	EgoSchemaQuestion Answering	CodeCode Available	1
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing	Nov 24, 2021	audio-visual event localizationVideo Understanding	CodeCode Available	1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning	Mar 2, 2025	Large Language ModelMulti-Instance Retrieval	CodeCode Available	1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding	May 27, 2025	Reinforcement Learning (RL)Video Understanding	CodeCode Available	1
Multimodal Long Video Modeling Based on Temporal Dynamic Context	Apr 14, 2025	Video Understanding	CodeCode Available	1
DEVIAS: Learning Disentangled Video Representations of Action and Scene	Nov 30, 2023	Action RecognitionDecoder	CodeCode Available	1
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions	Jun 19, 2021	Question AnsweringVideo Question Answering	CodeCode Available	1
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos	Jun 3, 2024	Mistake DetectionOnline Mistake Detection	CodeCode Available	1
Panoramic Vision Transformer for Saliency Detection in 360° Videos	Sep 19, 2022	Saliency DetectionSaliency Prediction	CodeCode Available	1
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models	Dec 15, 2023	Video Understanding	CodeCode Available	1
Crossover Learning for Fast Online Video Instance Segmentation	Apr 13, 2021	Instance SegmentationSemantic Segmentation	CodeCode Available	1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models	Jan 1, 2025	Action RecognitionAction Segmentation	CodeCode Available	1
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	May 20, 2025	Caption GenerationRetrieval	CodeCode Available	1
DisTime: Distribution-based Time Representation for Video Large Language Models	May 30, 2025	Temporal LocalizationVideo Understanding	CodeCode Available	1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation	Jun 15, 2025	ObjectSemantic Segmentation	CodeCode Available	1
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps	Mar 23, 2025	Scene SegmentationVideo Understanding	CodeCode Available	1
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation	Dec 16, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1
Contrastive Masked Autoencoders for Self-Supervised Video Hashing	Nov 21, 2022	DecoderRetrieval	CodeCode Available	1
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?	Mar 16, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
Do Language Models Understand Time?	Dec 18, 2024	Action RecognitionAnomaly Detection	CodeCode Available	1
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos	Aug 18, 2023	point cloud video understandingSelf-Supervised Learning	CodeCode Available	1
Localizing Moments in Long Video Via Multimodal Guidance	Feb 26, 2023	Natural Language Moment RetrievalNatural Language Visual Grounding	CodeCode Available	1
Long Movie Clip Classification with State-Space Video Models	Apr 4, 2022	ClassificationDecoder	CodeCode Available	1
Lightweight Network Architecture for Real-Time Action Recognition	May 21, 2019	Action RecognitionCPU	CodeCode Available	1
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge	Sep 30, 2022	DescriptiveRepresentation Learning	CodeCode Available	1
Learning the Predictability of the Future	Jun 19, 2021	Representation LearningSelf-Supervised Action Recognition	CodeCode Available	1
Procedure-Aware Pretraining for Instructional Video Understanding	Mar 31, 2023	Video Understanding	CodeCode Available	1
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection	Dec 9, 2021	Boundary DetectionDiversity	CodeCode Available	1
Learning Temporally Latent Causal Processes from General Temporal Data	Sep 29, 2021	Causal DiscoveryDisentanglement	CodeCode Available	1
Leveraging triplet loss for unsupervised action segmentation	Apr 13, 2023	Action SegmentationClustering	CodeCode Available	1
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding	Jul 8, 2025	Autonomous DrivingVideo Understanding	CodeCode Available	1
Event-Free Moving Object Segmentation from Moving Ego Vehicle	Apr 28, 2023	Autonomous DrivingBenchmarking	CodeCode Available	1
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions	Apr 21, 2022	Action DetectionVideo Understanding	CodeCode Available	1
Dual-path Adaptation from Image to Video Transformers	Mar 17, 2023	Action ClassificationAction Recognition	CodeCode Available	1
Learning Optical Flow with Adaptive Graph Reasoning	Feb 8, 2022	Motion EstimationOptical Flow Estimation	CodeCode Available	1
Relational Self-Attention: What's Missing in Attention for Video Understanding	Nov 2, 2021	Action RecognitionTemporal Action Localization	CodeCode Available	1
Learning Salient Boundary Feature for Anchor-free Temporal Action Localization	Mar 24, 2021	Action LocalizationTemporal Action Localization	CodeCode Available	1
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner	Jun 18, 2022	DecoderSemantic Segmentation	CodeCode Available	1
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment	Jan 1, 2025	audio-visual learningKnowledge Graphs	CodeCode Available	1
Compositional Video Understanding with Spatiotemporal Structure-based Transformers	Jan 1, 2024	Video Understanding	CodeCode Available	1
Language Repository for Long Video Understanding	Mar 21, 2024	EgoSchemaQuestion Answering	CodeCode Available	1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition	Jan 1, 2021	Action RecognitionVideo Understanding	CodeCode Available	1
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding	Mar 27, 2025	FormLanguage Modeling	CodeCode Available	1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs	Apr 21, 2025	Video Understanding	CodeCode Available	1

Show:10 25 50

← PrevPage 5 of 23Next →

No leaderboard results yet.