Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 851–900 of 1149 papers

Title	Date	Tasks	Status
Massively Parallel Video Networks	Jun 11, 2018	Action RecognitionTemporal Action Localization	—Unverified
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model	Apr 14, 2025	Computational EfficiencyLanguage Modeling	—Unverified
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding	Feb 5, 2025	DiversityEgoSchema	—Unverified
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization	Mar 12, 2025	Temporal LocalizationVideo Understanding	—Unverified
Memory Consolidation Enables Long-Context Video Understanding	Feb 8, 2024	EgoSchemaVideo Understanding	—Unverified
Memory-enhanced Retrieval Augmentation for Long Video Understanding	Mar 12, 2025	RAGRetrieval	—Unverified
Memory-Guided Semantic Learning Network for Temporal Sentence Grounding	Jan 3, 2022	SentenceTemporal Sentence Grounding	—Unverified
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD	Jun 11, 2024	Video RecognitionVideo Understanding	—Unverified
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound	Jan 7, 2022	Action ClassificationNavigate	—Unverified
Mid-level Representation for Visual Recognition	Dec 23, 2015	object-detectionObject Detection	—Unverified
Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain	Nov 19, 2019	Action RecognitionVideo Recognition	—Unverified
M-LLM Based Video Frame Selection for Efficient Video Understanding	Feb 27, 2025	EgoSchemaLanguage Modeling	—Unverified
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding	Jun 10, 2025	Language ModelingLanguage Modelling	—Unverified
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	Sep 30, 2024	Mixture-of-ExpertsOptical Character Recognition (OCR)	—Unverified
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding	Jun 20, 2024	FormVideo Understanding	—Unverified
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning	May 28, 2024	Decision MakingVideo Understanding	—Unverified
MM-Ego: Towards Building Egocentric Multimodal LLMs	Oct 9, 2024	Video Understanding	—Unverified
Moment Quantization for Video Temporal Grounding	Apr 3, 2025	QuantizationVideo Understanding	—Unverified
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval	Feb 18, 2025	Action RecognitionMoment Retrieval	—Unverified
Morph: Flexible Acceleration for 3D CNN-based Video Understanding	Oct 16, 2018	MORPHVideo Recognition	—Unverified
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models	Jan 6, 2025	BenchmarkingFeature Compression	—Unverified
Motion-Guided Masking for Spatiotemporal Representation Learning	Aug 24, 2023	Domain AdaptationRepresentation Learning	—Unverified
Motion Sensitive Contrastive Learning for Self-supervised Video Representation	Aug 12, 2022	Contrastive LearningRepresentation Learning	—Unverified
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies	Mar 3, 2024	Text GenerationVideo Understanding	—Unverified
MovieNet: A Holistic Dataset for Movie Understanding	Jul 21, 2020	Video Understanding	—Unverified
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning	Jun 4, 2023	BenchmarkingContrastive Learning	—Unverified
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	Dec 8, 2023	FormQuestion Answering	—Unverified
MRSN: Multi-Relation Support Network for Video Action Detection	Apr 24, 2023	Action DetectionRelation	—Unverified
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Jun 1, 2016	Image CaptioningSentence	—Unverified
Multi-kernel learning of deep convolutional features for action recognition	Jul 21, 2017	Action RecognitionActivity Recognition	—Unverified
Multimodal High-order Relation Transformer for Scene Boundary Detection	Jan 1, 2023	Boundary DetectionDecoder	—Unverified
Multimodal Intent Discovery from Livestream Videos	Jul 1, 2022	Intent DiscoveryVideo Summarization	—Unverified
Multi-modal Representation Learning for Video Advertisement Content Structuring	Sep 4, 2021	Representation LearningRe-Ranking	—Unverified
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation	Nov 30, 2023	Contrastive LearningDomain Adaptation	—Unverified
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding	May 29, 2025	RAGRetrieval-augmented Generation	—Unverified
Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization	Jan 16, 2024	DecoderDenoising	—Unverified
Multi-Scale Contrastive Learning for Video Temporal Grounding	Dec 10, 2024	Contrastive LearningData Augmentation	—Unverified
Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding	Mar 8, 2022	Contrastive LearningSentence	—Unverified
Multiview Transformers for Video Recognition	Jan 12, 2022	Action ClassificationAction Recognition	—Unverified
MVTamperBench: Evaluating Robustness of Vision-Language Models	Dec 27, 2024	Video Understanding	—Unverified
Representation Learning on Visual-Symbolic Graphs for Video Understanding	May 17, 2019	Action ClassificationAction Detection	—Unverified
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision	Dec 20, 2023	Action ClassificationAttribute	—Unverified
Non-local NetVLAD Encoding for Video Classification	Sep 29, 2018	ClassificationGeneral Classification	—Unverified
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning	Aug 5, 2021	AttributeCaption Generation	—Unverified
OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION	Sep 29, 2021	ObjectPredict Future Video Frames	—Unverified
Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge	Nov 15, 2021	Instance SegmentationObject Recognition	—Unverified
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding	Jul 6, 2024	Video Understanding	—Unverified
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	Mar 29, 2025	Streaming video understandingVideo Understanding	—Unverified
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks	Jan 14, 2025	Language ModelingLanguage Modelling	—Unverified
OmniTrack: Real-time detection and tracking of objects, text and logos in video	Oct 14, 2019	GPUobject-detection	—Unverified

Show:10 25 50

← PrevPage 18 of 23Next →

No leaderboard results yet.