Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 1149 papers

Title	Date	Tasks	Status	Hype
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering	Apr 25, 2025	Caption GenerationEgoSchema	CodeCode Available	1
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	Apr 24, 2025	MMEVideo MME	CodeCode Available	3
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation	Apr 24, 2025	Caption GenerationDense Video Captioning	—Unverified	0
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs	Apr 23, 2025	Token ReductionVideo Understanding	—Unverified	0
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs	Apr 21, 2025	Video Understanding	CodeCode Available	1
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	Apr 21, 2025	MMEVideo MME	CodeCode Available	4
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes	Apr 21, 2025	MMEVideo MME	—Unverified	0
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection	Apr 20, 2025	Action DetectionDecoder	—Unverified	0
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding	Apr 20, 2025	Language ModelingLanguage Modelling	—Unverified	0
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task	Apr 20, 2025	Language ModelingLanguage Modelling	—Unverified	0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding	Apr 20, 2025	Autonomous DrivingImage Captioning	CodeCode Available	0
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?	Apr 19, 2025	Video Understanding	—Unverified	0
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models	Apr 17, 2025	HallucinationVideo Understanding	CodeCode Available	1
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval	Apr 17, 2025	Partially Relevant Video RetrievalRetrieval	—Unverified	0
Perception Encoder: The best visual embeddings are not at the output of the network	Apr 17, 2025	Depth EstimationLanguage Modeling	CodeCode Available	8
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	Apr 17, 2025	Video Question AnsweringVideo Understanding	CodeCode Available	7
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization	Apr 16, 2025	HallucinationQuestion Answering	—Unverified	0
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding	Apr 15, 2025	Semantic SegmentationVideo Generation	—Unverified	0
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild	Apr 15, 2025	SegmentationSemantic Segmentation	—Unverified	0
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model	Apr 14, 2025	Computational EfficiencyLanguage Modeling	—Unverified	0
Multimodal Long Video Modeling Based on Temporal Dynamic Context	Apr 14, 2025	Video Understanding	CodeCode Available	1
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning	Apr 13, 2025	Question Answeringreinforcement-learning	CodeCode Available	2
F^3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos	Apr 11, 2025	Action UnderstandingEvent Detection	CodeCode Available	1
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking	Apr 11, 2025	Moment RetrievalQuestion Answering	—Unverified	0
How Can Objects Help Video-Language Understanding?	Apr 10, 2025	Image CaptioningObject	—Unverified	0
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding	Apr 10, 2025	Instruction FollowingVideo Understanding	—Unverified	0
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding	Apr 10, 2025	Video Understanding	—Unverified	0
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	Apr 9, 2025	MVBenchObject Tracking	CodeCode Available	3
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models	Apr 8, 2025	In-Context LearningInstruction Following	—Unverified	0
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction	Apr 8, 2025	Game State ReconstructionJersey Number Recognition	—Unverified	0
InstructionBench: An Instructional Video Understanding Benchmark	Apr 7, 2025	Common Sense ReasoningMultiple-choice	—Unverified	0
Re-thinking Temporal Search for Long-Form Video Understanding	Apr 3, 2025	Computational EfficiencyForm	CodeCode Available	2
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval	Apr 3, 2025	Information RetrievalRepresentation Learning	—Unverified	0
Moment Quantization for Video Temporal Grounding	Apr 3, 2025	QuantizationVideo Understanding	—Unverified	0
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation	Apr 3, 2025	Computational EfficiencyGPU	CodeCode Available	2
Aligned Better, Listen Better for Audio-Visual Large Language Models	Apr 2, 2025	Video Understanding	—Unverified	0
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?	Apr 2, 2025	Action RecognitionAll	—Unverified	0
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding	Apr 2, 2025	Video Understanding	—Unverified	0
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning	Apr 2, 2025	MMESpatial Reasoning	CodeCode Available	2
Slow-Fast Architecture for Video Multi-Modal Large Language Models	Apr 2, 2025	Video Understanding	CodeCode Available	1
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	Mar 31, 2025	Logical ReasoningMultiple-choice	CodeCode Available	2
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding	Mar 31, 2025	Video Understanding	—Unverified	0
DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description	Mar 31, 2025	Video DescriptionVideo Understanding	—Unverified	0
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition	Mar 30, 2025	Action ClassificationAction Recognition	—Unverified	0
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	Mar 29, 2025	Streaming video understandingVideo Understanding	—Unverified	0
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding	Mar 27, 2025	FormLanguage Modeling	CodeCode Available	1
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model	Mar 27, 2025	EgoSchemaLanguage Modeling	CodeCode Available	2
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment	Mar 26, 2025	Video Understanding	—Unverified	0
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding	Mar 26, 2025	GPUQuestion Answering	—Unverified	0
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations	Mar 25, 2025	Representation LearningVideo Understanding	CodeCode Available	0

Show:10 25 50

← PrevPage 3 of 23Next →

No leaderboard results yet.