Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 1149 papers

Title	Date	Tasks	Status	Hype
VUDG: A Dataset for Video Understanding Domain Generalization	May 30, 2025	Domain GeneralizationMultiple-choice	—Unverified	0
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders	May 30, 2025	Video Understanding	—Unverified	0
SiLVR: A Simple Language-based Video Reasoning Framework	May 30, 2025	MathMME	CodeCode Available	1
Learning reusable concepts across different egocentric video understanding tasks	May 30, 2025	Video Understanding	—Unverified	0
VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software	May 30, 2025	Question AnsweringSpatial Reasoning	CodeCode Available	1
Time Blindness: Why Video-Language Models Can't See What Humans Can?	May 30, 2025	Temporal SequencesVideo Understanding	—Unverified	0
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding	May 29, 2025	RAGRetrieval-augmented Generation	—Unverified	0
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection	May 29, 2025	image-classificationImage Classification	—Unverified	0
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding	May 29, 2025	AvgVideo Understanding	CodeCode Available	0
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models	May 29, 2025	Self-Supervised LearningVideo Generation	CodeCode Available	2
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?	May 29, 2025	Video Understanding	CodeCode Available	1
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling	May 29, 2025	Video Understanding	CodeCode Available	1
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory	May 29, 2025	Contrastive LearningText Retrieval	CodeCode Available	2
Universal Visuo-Tactile Video Understanding for Embodied Interaction	May 28, 2025	FrictionLarge Language Model	—Unverified	0
VidText: Towards Comprehensive Evaluation for Video Text Understanding	May 28, 2025	Multimodal ReasoningOptical Character Recognition (OCR)	CodeCode Available	1
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding	May 27, 2025	Reinforcement Learning (RL)Video Understanding	CodeCode Available	1
Two Causally Related Needles in a Video Haystack	May 26, 2025	Video UnderstandingVisual Grounding	—Unverified	0
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models	May 26, 2025	Video Understanding	—Unverified	0
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos	May 26, 2025	AttributeVideo Understanding	CodeCode Available	0
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs	May 25, 2025	Video Understanding	—Unverified	0
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding	May 23, 2025	FormQuestion Answering	—Unverified	0
SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding	May 22, 2025	Action ClassificationAutomatic Speech Recognition	CodeCode Available	0
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles	May 22, 2025	EgoSchemaFew-Shot Learning	—Unverified	0
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning	May 22, 2025	Misinformationreinforcement-learning	CodeCode Available	1
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design	May 22, 2025	CPUGPU	CodeCode Available	2
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation	May 21, 2025	Decision MakingLanguage Modeling	CodeCode Available	0
Clapper: Compact Learning and Video Representation in VLMs	May 21, 2025	Video Understanding	—Unverified	0
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning	May 21, 2025	Pseudo LabelReinforcement Learning (RL)	—Unverified	0
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	May 21, 2025	Autonomous DrivingQuestion Answering	—Unverified	0
Leveraging Foundation Models for Multimodal Graph-Based Action Recognition	May 21, 2025	Action RecognitionGraph Attention	—Unverified	0
A Challenge to Build Neuro-Symbolic Video Agents	May 20, 2025	Scene ClassificationVideo Retrieval	CodeCode Available	0
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models	May 20, 2025	Video CompressionVideo Understanding	CodeCode Available	2
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?	May 20, 2025	Video Understanding	—Unverified	0
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	May 20, 2025	Caption GenerationRetrieval	CodeCode Available	1
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation	May 20, 2025	MMEMultiple-choice	CodeCode Available	4
Domain Adaptation of VLM for Soccer Video Understanding	May 20, 2025	Action ClassificationDomain Adaptation	—Unverified	0
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding	May 19, 2025	Language ModelingLanguage Modelling	CodeCode Available	0
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations	May 18, 2025	Video EditingVideo Understanding	—Unverified	0
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	May 13, 2025	FormMultiple-choice	CodeCode Available	0
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation	May 13, 2025	Computational EfficiencyVideo Understanding	—Unverified	0
Gameplay Highlights Generation	May 12, 2025	Event DetectionHighlight Detection	—Unverified	0
Seed1.5-VL Technical Report	May 11, 2025	Mixture-of-ExpertsMultimodal Reasoning	—Unverified	0
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	May 8, 2025	Language ModelingLanguage Modelling	—Unverified	0
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph	May 6, 2025	EgoSchemaRetrieval	—Unverified	0
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection	May 5, 2025	Anomaly DetectionAnomaly Detection In Surveillance Videos	CodeCode Available	1
VideoLLM Benchmarks and Evaluation: A Survey	May 3, 2025	SurveyVideo Understanding	—Unverified	0
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding	May 2, 2025	Anomaly DetectionCommon Sense Reasoning	CodeCode Available	1
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action	May 2, 2025	Dense CaptioningHighlight Detection	CodeCode Available	1
Empowering Agentic Video Analytics Systems with Video Language Models	May 1, 2025	Knowledge GraphsRAG	—Unverified	0
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding	Apr 30, 2025	Video Understanding	CodeCode Available	0

Show:10 25 50

← PrevPage 2 of 23Next →

No leaderboard results yet.