Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 1149 papers

Title	Date	Tasks	Status	Hype
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	Jul 17, 2025	Video GroundingVideo Understanding	—Unverified	0
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks	Jul 15, 2025	Video CaptioningVideo Understanding	CodeCode Available	1
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI	Jul 14, 2025	Large Language ModelMultimodal Large Language Model	—Unverified	0
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	Jul 14, 2025	Scene UnderstandingSpatial Reasoning	—Unverified	0
Omni-Video: Democratizing Unified Video Understanding and Generation	Jul 8, 2025	Video GenerationVideo Understanding	CodeCode Available	2
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding	Jul 8, 2025	Autonomous DrivingVideo Understanding	CodeCode Available	1
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models	Jul 8, 2025	Future predictionLarge Language Model	—Unverified	0
Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation	Jul 8, 2025	Depth EstimationDepth Prediction	—Unverified	0
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges	Jul 2, 2025	Video Understanding	—Unverified	0
Kwai Keye-VL Technical Report	Jul 2, 2025	Instruction FollowingReinforcement Learning (RL)	CodeCode Available	4
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning	Jul 1, 2025	document understandingMultimodal Reasoning	CodeCode Available	7
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs	Jul 1, 2025	Text GenerationVideo Understanding	—Unverified	0
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	Jun 30, 2025	cross-modal alignmentEgoSchema	CodeCode Available	3
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment	Jun 28, 2025	Dynamic Time WarpingLarge Language Model	CodeCode Available	0
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs	Jun 27, 2025	MMEVideo MME	—Unverified	0
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs	Jun 27, 2025	Question AnsweringVideo Question Answering	CodeCode Available	2
IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes	Jun 26, 2025	AttributeQuestion Answering	—Unverified	0
Task-Aware KV Compression For Cost-Effective Long Video Understanding	Jun 26, 2025	Video Understanding	CodeCode Available	0
PEVLM: Parallel Encoding for Vision-Language Models	Jun 24, 2025	Autonomous DrivingVideo Understanding	—Unverified	0
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning	Jun 19, 2025	Multimodal Reasoningreinforcement-learning	—Unverified	0
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models	Jun 18, 2025	Audio captioningLarge Language Model	CodeCode Available	2
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	Jun 18, 2025	GPUStreaming video understanding	—Unverified	0
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization	Jun 17, 2025	Multi-Instance RetrievalRetrieval	CodeCode Available	0
MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models	Jun 16, 2025	Video Understanding	—Unverified	0
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding	Jun 16, 2025	Optical Character Recognition (OCR)RAG	CodeCode Available	0
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation	Jun 15, 2025	ObjectSemantic Segmentation	CodeCode Available	1
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation	Jun 13, 2025	Anomaly DetectionClustering	CodeCode Available	1
VideoDeepResearch: Long Video Understanding With Agentic Tool Using	Jun 12, 2025	MMEVideo MME	CodeCode Available	2
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios	Jun 11, 2025	Action RecognitionAction Segmentation	CodeCode Available	0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks	Jun 10, 2025	Multiple-choiceOpen-Ended Question Answering	—Unverified	0
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding	Jun 10, 2025	Language ModelingLanguage Modelling	—Unverified	0
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding	Jun 9, 2025	RAGRetrieval	—Unverified	0
CyberV: Cybernetics for Test-time Scaling in Video Understanding	Jun 9, 2025	Video Understanding	CodeCode Available	1
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding	Jun 9, 2025	Contrastive LearningVideo Editing	—Unverified	0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis	Jun 9, 2025	Action ClassificationBenchmarking	—Unverified	0
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models	Jun 6, 2025	SegmentationVideo Understanding	—Unverified	0
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision	Jun 6, 2025	Video Understanding	CodeCode Available	0
TextVidBench: A Benchmark for Long Video Scene Text Understanding	Jun 5, 2025	Prompt EngineeringQuestion Answering	—Unverified	0
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval	Jun 5, 2025	Information RetrievalRetrieval	—Unverified	0
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation	Jun 5, 2025	Motion CompensationOptical Flow Estimation	—Unverified	0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified	0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding	Jun 4, 2025	MMEVideo MME	—Unverified	0
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding	Jun 3, 2025	Video Understanding	CodeCode Available	0
EgoVLM: Policy Optimization for Egocentric Video Understanding	Jun 3, 2025	EgoSchemaQuestion Answering	CodeCode Available	0
InterRVOS: Interaction-aware Referring Video Object Segmentation	Jun 3, 2025	8kObject	—Unverified	0
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding	Jun 2, 2025	Action RecognitionVideo Understanding	—Unverified	0
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency	Jun 2, 2025	reinforcement-learningReinforcement Learning	CodeCode Available	2
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding	Jun 1, 2025	Video Understanding	—Unverified	0
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis	May 31, 2025	Scene SegmentationSegmentation	—Unverified	0
SiLVR: A Simple Language-based Video Reasoning Framework	May 30, 2025	MathMME	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 23Next →

No leaderboard results yet.