Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 1149 papers

Title	Date	Tasks	Status
Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey	Jun 5, 2022	3D Hand Pose EstimationDomain Adaptation	—Unverified
An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform	Jun 26, 2017	ClassificationDeep Learning	—Unverified
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction	Nov 19, 2024	GPUQuestion Answering	—Unverified
EAGLE: Egocentric AGgregated Language-video Engine	Sep 26, 2024	Action RecognitionActivity Recognition	—Unverified
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding	Jun 4, 2025	MMEVideo MME	—Unverified
BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes	Apr 4, 2024	ObjectVideo Understanding	—Unverified
An Attempt towards Interpretable Audio-Visual Video Captioning	Dec 7, 2018	Audio captioningAudio-Visual Video Captioning	—Unverified
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding	Nov 19, 2024	Question AnsweringVideo Understanding	—Unverified
Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering	Jul 1, 2022	Question AnsweringVideo Question Answering	—Unverified
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding	Nov 21, 2024	Computational EfficiencyVideo Understanding	—Unverified
Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition	Dec 13, 2018	3D Action RecognitionAction Recognition	—Unverified
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training	Nov 23, 2022	Action RecognitionTemporal Action Localization	—Unverified
Beyond the Camera: Neural Networks in World Coordinates	Mar 12, 2020	Action RecognitionVideo Stabilization	—Unverified
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks	Oct 7, 2023	Action RecognitionMultiple-choice	—Unverified
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs	Apr 23, 2025	Token ReductionVideo Understanding	—Unverified
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation	Jun 5, 2025	Motion CompensationOptical Flow Estimation	—Unverified
Beyond still images: Temporal features and input variance resilience	Nov 1, 2023	Video Understanding	—Unverified
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM	Oct 3, 2024	Object TrackingVideo Understanding	—Unverified
Abductive Ego-View Accident Video Understanding for Safe Driving Perception	Mar 1, 2024	Objectobject-detection	—Unverified
An Analysis of Data Transformation Effects on Segment Anything 2	Feb 25, 2025	Semantic SegmentationVideo Object Segmentation	—Unverified
Learning text-to-video retrieval from image captioning	Apr 26, 2024	Image CaptioningImage Retrieval	—Unverified
Dilated Temporal Relational Adversarial Network for Generic Video Summarization	Apr 30, 2018	Generative Adversarial NetworkVideo Summarization	—Unverified
DrVideo: Document Retrieval Based Long Video Understanding	Jun 18, 2024	document understandingEgoSchema	—Unverified
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model	Oct 2, 2023	Autonomous DrivingLanguage Modeling	—Unverified
A Multimodal Sentiment Dataset for Video Recommendation	Sep 17, 2021	Multimodal Sentiment AnalysisSentiment Analysis	—Unverified
A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives	Mar 5, 2024	Video Understanding	—Unverified
Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection	Dec 6, 2024	GPUMulti-Object Tracking	—Unverified
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning	Jan 1, 2024	object-detectionObject Detection	—Unverified
Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation	Jul 8, 2025	Depth EstimationDepth Prediction	—Unverified
DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation	Jul 31, 2023	Action SegmentationHuman-Object Interaction Detection	—Unverified
BERT for Large-scale Video Segment Classification with Test-time Augmentation	Dec 2, 2019	General ClassificationVideo Understanding	—Unverified
AMEGO: Active Memory from long EGOcentric videos	Sep 17, 2024	Video Understanding	—Unverified
Domain Adaptation of VLM for Soccer Video Understanding	May 20, 2025	Action ClassificationDomain Adaptation	—Unverified
Actor-Action Semantic Segmentation with Grouping Process Models	Dec 30, 2015	Semantic SegmentationVideo Understanding	—Unverified
BEARCUBS: A benchmark for computer-using web agents	Mar 10, 2025	Video Understanding	—Unverified
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering	Mar 20, 2025	Contrastive LearningQuestion Answering	—Unverified
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	Jan 21, 2025	Object TrackingReferring Expression Segmentation	—Unverified
DOAD: Decoupled One Stage Action Detection Network	Apr 1, 2023	Action DetectionAction Recognition	—Unverified
BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation	Aug 1, 2022	ObjectOptical Flow Estimation	—Unverified
ALLVB: All-in-One Long Video Understanding Benchmark	Mar 10, 2025	AllVideo Understanding	—Unverified
Learning reusable concepts across different egocentric video understanding tasks	May 30, 2025	Video Understanding	—Unverified
Learning Space-Time Semantic Correspondences	Jun 16, 2023	Imitation LearningSemantic correspondence	—Unverified
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Jul 13, 2023	Action RecognitionContrastive Learning	—Unverified
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	Jan 21, 2025	Instruction FollowingMathematical Reasoning	—Unverified
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition	Jan 11, 2019	Action ClassificationAction Recognition	—Unverified
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	Jul 3, 2024	ArticlesImage Comprehension	—Unverified
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning	Aug 29, 2024	Multi-Task LearningPrompt Learning	—Unverified
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection	Nov 27, 2018	Objectobject-detection	—Unverified
Instrument-tissue Interaction Detection Framework for Surgical Video Understanding	Mar 30, 2024	Video Understanding	—Unverified
InstructionBench: An Instructional Video Understanding Benchmark	Apr 7, 2025	Common Sense ReasoningMultiple-choice	—Unverified

Show:10 25 50

← PrevPage 9 of 23Next →

No leaderboard results yet.