Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 1149 papers

Title	Date	Tasks	Status	Hype	Score
Occluded Video Instance Segmentation: A Benchmark	Feb 2, 2021	Instance SegmentationSegmentation	CodeCode Available	1	5
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding	Mar 27, 2025	FormLanguage Modeling	CodeCode Available	1	5
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding	May 14, 2024	Action DetectionGPU	CodeCode Available	1	5
Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention	Jun 11, 2021	Action RecognitionSign Language Recognition	CodeCode Available	1	5
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions	May 18, 2021	Question AnsweringVideo Question Answering	CodeCode Available	1	5
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs	Apr 21, 2025	Video Understanding	CodeCode Available	1	5
A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions	Apr 21, 2022	Action DetectionVideo Understanding	CodeCode Available	1	5
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation	Oct 31, 2024	Action SegmentationAction Understanding	CodeCode Available	1	5
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions	Jun 19, 2021	Question AnsweringVideo Question Answering	CodeCode Available	1	5
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization	Jun 14, 2020	Action DetectionAction Localization	CodeCode Available	1	5
A Multigrid Method for Efficiently Training Video Models	Dec 2, 2019	Action DetectionAction Recognition	CodeCode Available	1	5
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges	Nov 17, 2022	Future Hand PredictionMoment Queries	CodeCode Available	1	5
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment	Jan 1, 2025	audio-visual learningKnowledge Graphs	CodeCode Available	1	5
Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions	Oct 13, 2021	BenchmarkingComputational Efficiency	CodeCode Available	1	5
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding	May 27, 2025	Reinforcement Learning (RL)Video Understanding	CodeCode Available	1	5
IntentVizor: Towards Generic Query Guided Interactive Video Summarization	Sep 30, 2021	Video SummarizationVideo Understanding	CodeCode Available	1	5
Is Appearance Free Action Recognition Possible?	Jul 13, 2022	Action RecognitionOptical Flow Estimation	CodeCode Available	1	5
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection	May 5, 2022	Action Detectionobject-detection	CodeCode Available	1	5
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding	Jun 28, 2024	Multiple-choiceVideo Understanding	CodeCode Available	1	5
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation	Jan 31, 2025	Question AnsweringVideo Question Answering	CodeCode Available	1	5
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation	Jan 14, 2025	MambaVideo Understanding	CodeCode Available	1	5
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding	Jun 19, 2024	Question AnsweringSpatial Reasoning	CodeCode Available	1	5
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding	Jul 11, 2024	EEGLanguage Modeling	CodeCode Available	1	5
Multimodal Distillation for Egocentric Action Recognition	Jul 14, 2023	Action RecognitionKnowledge Distillation	CodeCode Available	1	5
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding	Nov 25, 2023	Video Understanding	CodeCode Available	1	5
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions	May 23, 2017	Actin DetectionAction Detection	CodeCode Available	1	5
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models	Mar 20, 2025	Multiple-choiceVideo Understanding	CodeCode Available	1	5
Multimodal Long Video Modeling Based on Temporal Dynamic Context	Apr 14, 2025	Video Understanding	CodeCode Available	1	5
AutoVideo: An Automated Video Action Recognition System	Aug 9, 2021	Action RecognitionAutoML	CodeCode Available	1	5
How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?	Mar 27, 2022	Self-Supervised LearningSensitivity	CodeCode Available	1	5
Action Scene Graphs for Long-Form Understanding of Egocentric Videos	Dec 6, 2023	Action AnticipationForm	CodeCode Available	1	5
Large Scale Holistic Video Understanding	Apr 25, 2019	Action ClassificationAction Recognition	CodeCode Available	1	5
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation	Dec 12, 2023	Anomaly DetectionAutonomous Driving	CodeCode Available	1	5
MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing	Nov 28, 2022	Activity RecognitionFew Shot Action Recognition	CodeCode Available	1	5
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives	Feb 4, 2025	Video Understanding	CodeCode Available	1	5
Learning Video Context as Interleaved Multimodal Sequences	Jul 31, 2024	Language ModelingLanguage Modelling	CodeCode Available	1	5
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning	Mar 2, 2025	Large Language ModelMulti-Instance Retrieval	CodeCode Available	1	5
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization	Aug 12, 2024	Action LocalizationTemporal Action Localization	CodeCode Available	1	5
Language Repository for Long Video Understanding	Mar 21, 2024	EgoSchemaQuestion Answering	CodeCode Available	1	5
Agentic Keyframe Search for Video Question Answering	Mar 20, 2025	EgoSchemaQuestion Answering	CodeCode Available	1	5
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning	Dec 4, 2024	Multimodal Large Language ModelVideo Understanding	CodeCode Available	1	5
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning	Jan 1, 2023	Contrastive LearningRepresentation Learning	CodeCode Available	1	5
MotionSqueeze: Neural Motion Feature Learning for Video Understanding	Jul 20, 2020	Action ClassificationAction Recognition	CodeCode Available	1	5
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions	May 16, 2021	Action DetectionAction Localization	CodeCode Available	1	5
Crossover Learning for Fast Online Video Instance Segmentation	Apr 13, 2021	Instance SegmentationSemantic Segmentation	CodeCode Available	1	5
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities	Jan 10, 2025	Human-Object Interaction DetectionKnowledge Distillation	CodeCode Available	1	5
CyberV: Cybernetics for Test-time Scaling in Video Understanding	Jun 9, 2025	Video Understanding	CodeCode Available	1	5
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model	Aug 15, 2023	DecoderObject	CodeCode Available	1	5
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering	May 30, 2022	counterfactualDescriptive	CodeCode Available	1	5
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding	Sep 27, 2024	Video UnderstandingVisual Reasoning	CodeCode Available	1	5

Show:10 25 50

← PrevPage 4 of 23Next →

No leaderboard results yet.