Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–350 of 1149 papers

Title	Date	Tasks	Status	Hype
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing	Nov 24, 2021	audio-visual event localizationVideo Understanding	CodeCode Available	1
ETAD: Training Action Detection End to End on a Laptop	May 14, 2022	Action DetectionGPU	CodeCode Available	1
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation	Jun 15, 2025	ObjectSemantic Segmentation	CodeCode Available	1
EPIC Fields: Marrying 3D Geometry and Video Understanding	Jun 14, 2023	3D geometryNeural Rendering	CodeCode Available	1
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	May 20, 2025	Caption GenerationRetrieval	CodeCode Available	1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis	Apr 12, 2024	Dense Video CaptioningTransfer Learning	CodeCode Available	1
A Simple LLM Framework for Long-Range Video Question-Answering	Dec 28, 2023	EgoSchemaLanguage Modelling	CodeCode Available	1
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation	Mar 18, 2024	Referring Video Object SegmentationSemantic Segmentation	CodeCode Available	1
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning	Mar 2, 2025	Large Language ModelMulti-Instance Retrieval	CodeCode Available	1
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning	Jan 1, 2023	Contrastive LearningRepresentation Learning	CodeCode Available	1
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization	Aug 4, 2021	Contrastive LearningRepresentation Learning	CodeCode Available	1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering	May 30, 2022	counterfactualDescriptive	CodeCode Available	1
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models	Jan 1, 2025	Action RecognitionAction Segmentation	CodeCode Available	1
MM-VID: Advancing Video Understanding with GPT-4V(ision)	Oct 30, 2023	Script GenerationVideo Understanding	CodeCode Available	1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions	May 18, 2021	Question AnsweringVideo Question Answering	CodeCode Available	1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness	Jan 14, 2025	Event ExtractionInstruction Following	CodeCode Available	1
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning	May 22, 2025	Misinformationreinforcement-learning	CodeCode Available	1
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions	May 16, 2021	Action DetectionAction Localization	CodeCode Available	1
Lightweight Network Architecture for Real-Time Action Recognition	May 21, 2019	Action RecognitionCPU	CodeCode Available	1
End-to-End Video Instance Segmentation with Transformers	Nov 30, 2020	Instance SegmentationSegmentation	CodeCode Available	1
Object-Region Video Transformers	Oct 13, 2021	Action DetectionAction Recognition	CodeCode Available	1
Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection	Jun 28, 2021	Action RecognitionAction Spotting	CodeCode Available	1
Leveraging triplet loss for unsupervised action segmentation	Apr 13, 2023	Action SegmentationClustering	CodeCode Available	1
Localizing Moments in Long Video Via Multimodal Guidance	Feb 26, 2023	Natural Language Moment RetrievalNatural Language Visual Grounding	CodeCode Available	1
Learning Temporally Latent Causal Processes from General Temporal Data	Sep 29, 2021	Causal DiscoveryDisentanglement	CodeCode Available	1
End-to-end Temporal Action Detection with Transformer	Jun 18, 2021	Action DetectionTemporal Action Localization	CodeCode Available	1
Learning the Predictability of the Future	Jun 19, 2021	Representation LearningSelf-Supervised Action Recognition	CodeCode Available	1
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads	Jun 27, 2024	Diversityimage-classification	CodeCode Available	1
End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning	Sep 27, 2023	Action RecognitionAction Segmentation	CodeCode Available	1
Open-Vocabulary Video Relation Extraction	Dec 25, 2023	Action ClassificationAction Understanding	CodeCode Available	1
End-to-End Referring Video Object Segmentation with Multimodal Transformers	Nov 29, 2021	Inductive BiasInstance Segmentation	CodeCode Available	1
Panoramic Vision Transformer for Saliency Detection in 360° Videos	Sep 19, 2022	Saliency DetectionSaliency Prediction	CodeCode Available	1
Learning Temporally Causal Latent Processes from General Temporal Data	Oct 11, 2021	Causal DiscoveryRepresentation Learning	CodeCode Available	1
PAVE: Patching and Adapting Video Large Language Models	Mar 25, 2025	Audio-visual Question AnsweringMulti-Task Learning	CodeCode Available	1
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge	Sep 30, 2022	DescriptiveRepresentation Learning	CodeCode Available	1
Learning Optical Flow with Adaptive Graph Reasoning	Feb 8, 2022	Motion EstimationOptical Flow Estimation	CodeCode Available	1
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark	Oct 24, 2024	document understandingVideo Understanding	CodeCode Available	1
Learning Salient Boundary Feature for Anchor-free Temporal Action Localization	Mar 24, 2021	Action LocalizationTemporal Action Localization	CodeCode Available	1
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment	Jan 1, 2025	audio-visual learningKnowledge Graphs	CodeCode Available	1
An overview on the evaluated video retrieval tasks at TRECVID 2022	Jun 22, 2023	Ad-hoc video searchRetrieval	CodeCode Available	1
Procedure-Aware Pretraining for Instructional Video Understanding	Mar 31, 2023	Video Understanding	CodeCode Available	1
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection	Dec 9, 2021	Boundary DetectionDiversity	CodeCode Available	1
Language Repository for Long Video Understanding	Mar 21, 2024	EgoSchemaQuestion Answering	CodeCode Available	1
Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition	Jan 1, 2021	Action RecognitionVideo Understanding	CodeCode Available	1
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs	Apr 21, 2025	Video Understanding	CodeCode Available	1
A Comprehensive Study of Deep Video Action Recognition	Dec 11, 2020	Action RecognitionDeep Learning	CodeCode Available	1
Elaborative Rehearsal for Zero-shot Action Recognition	Aug 5, 2021	Action RecognitionFew-Shot Learning	CodeCode Available	1
Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions	May 19, 2022	Contrastive LearningSelf-Supervised Learning	CodeCode Available	1
FrameExit: Conditional Early Exiting for Efficient Video Recognition	Apr 27, 2021	Video RecognitionVideo Understanding	CodeCode Available	1
Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation	Oct 31, 2024	Action SegmentationAction Understanding	CodeCode Available	1

Show:10 25 50

← PrevPage 7 of 23Next →

No leaderboard results yet.