SOTAVerified|Agents Browse Leaderboard About

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 876–900 of 1149 papers

Title	Date	Tasks	Status
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning	Jun 4, 2023	BenchmarkingContrastive Learning	—Unverified
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	Dec 8, 2023	FormQuestion Answering	—Unverified
MRSN: Multi-Relation Support Network for Video Action Detection	Apr 24, 2023	Action DetectionRelation	—Unverified
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Jun 1, 2016	Image CaptioningSentence	—Unverified
Multi-kernel learning of deep convolutional features for action recognition	Jul 21, 2017	Action RecognitionActivity Recognition	—Unverified
Multimodal High-order Relation Transformer for Scene Boundary Detection	Jan 1, 2023	Boundary DetectionDecoder	—Unverified
Multimodal Intent Discovery from Livestream Videos	Jul 1, 2022	Intent DiscoveryVideo Summarization	—Unverified
Multi-modal Representation Learning for Video Advertisement Content Structuring	Sep 4, 2021	Representation LearningRe-Ranking	—Unverified
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation	Nov 30, 2023	Contrastive LearningDomain Adaptation	—Unverified
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding	May 29, 2025	RAGRetrieval-augmented Generation	—Unverified
Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization	Jan 16, 2024	DecoderDenoising	—Unverified
Multi-Scale Contrastive Learning for Video Temporal Grounding	Dec 10, 2024	Contrastive LearningData Augmentation	—Unverified
Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding	Mar 8, 2022	Contrastive LearningSentence	—Unverified
Multiview Transformers for Video Recognition	Jan 12, 2022	Action ClassificationAction Recognition	—Unverified
MVTamperBench: Evaluating Robustness of Vision-Language Models	Dec 27, 2024	Video Understanding	—Unverified
Representation Learning on Visual-Symbolic Graphs for Video Understanding	May 17, 2019	Action ClassificationAction Detection	—Unverified
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision	Dec 20, 2023	Action ClassificationAttribute	—Unverified
Non-local NetVLAD Encoding for Video Classification	Sep 29, 2018	ClassificationGeneral Classification	—Unverified
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning	Aug 5, 2021	AttributeCaption Generation	—Unverified
OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION	Sep 29, 2021	ObjectPredict Future Video Frames	—Unverified
Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge	Nov 15, 2021	Instance SegmentationObject Recognition	—Unverified
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding	Jul 6, 2024	Video Understanding	—Unverified
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	Mar 29, 2025	Streaming video understandingVideo Understanding	—Unverified
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks	Jan 14, 2025	Language ModelingLanguage Modelling	—Unverified
OmniTrack: Real-time detection and tracking of objects, text and logos in video	Oct 14, 2019	GPUobject-detection	—Unverified

Show:10 25 50

← PrevPage 36 of 46Next →

No leaderboard results yet.