Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–550 of 1149 papers

Title	Date	Tasks	Status
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment	Mar 26, 2025	Video Understanding	—Unverified
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations	Mar 25, 2025	Representation LearningVideo Understanding	CodeCode Available
ACVUBench: Audio-Centric Video Understanding Benchmark	Mar 25, 2025	Video Understanding	CodeCode Available
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks	Mar 24, 2025	Common Sense ReasoningPrediction	—Unverified
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding	Mar 24, 2025	FormVideo Understanding	—Unverified
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding	Mar 24, 2025	8kGPU	—Unverified
CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos	Mar 24, 2025	Anomaly DetectionAnomaly Detection In Surveillance Videos	—Unverified
Breaking the Encoder Barrier for Seamless Video-Language Understanding	Mar 24, 2025	DecoderLanguage Modeling	—Unverified
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding	Mar 22, 2025	BenchmarkingObject	CodeCode Available
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization	Mar 22, 2025	Saliency DetectionSentence	—Unverified
PVChat: Personalized Video Chat with One-Shot Learning	Mar 21, 2025	One-Shot LearningQuestion Answering	—Unverified
Temporal Action Detection Model Compression by Progressive Block Drop	Mar 21, 2025	Action DetectionAutonomous Driving	—Unverified
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering	Mar 20, 2025	Contrastive LearningQuestion Answering	—Unverified
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations	Mar 20, 2025	HallucinationVideo Understanding	—Unverified
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?	Mar 20, 2025	DecoderGraph Generation	—Unverified
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	Mar 19, 2025	BenchmarkingMultiple-choice	—Unverified
Improving LLM Video Understanding with 16 Frames Per Second	Mar 18, 2025	MMEVideo MME	—Unverified
Impossible Videos	Mar 18, 2025	counterfactualVideo Generation	—Unverified
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability	Mar 18, 2025	Language ModelingLanguage Modelling	—Unverified
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition	Mar 17, 2025	Action RecognitionVideo Recognition	—Unverified
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory	Mar 17, 2025	FormGPU	—Unverified
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding	Mar 17, 2025	AttributeMME	—Unverified
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	Mar 14, 2025	BenchmarkingRelational Reasoning	—Unverified
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers	Mar 14, 2025	GPUMamba	—Unverified
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding	Mar 14, 2025	DenoisingDense Video Captioning	—Unverified
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs	Mar 14, 2025	Video Understanding	—Unverified
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing	Mar 13, 2025	EgoSchemaForm	CodeCode Available
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs	Mar 13, 2025	BenchmarkingQuestion Answering	—Unverified
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents	Mar 13, 2025	Computational EfficiencyOptical Character Recognition (OCR)	—Unverified
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation	Mar 12, 2025	Allcounterfactual	—Unverified
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization	Mar 12, 2025	Temporal LocalizationVideo Understanding	—Unverified
Generative Frame Sampler for Long Video Understanding	Mar 12, 2025	Video Understanding	—Unverified
Memory-enhanced Retrieval Augmentation for Long Video Understanding	Mar 12, 2025	RAGRetrieval	—Unverified
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models	Mar 12, 2025	Mixture-of-ExpertsQuestion Answering	—Unverified
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	Mar 12, 2025	GPUStreaming video understanding	—Unverified
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding	Mar 12, 2025	Instruction FollowingVideo Understanding	—Unverified
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment	Mar 12, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
On the Limitations of Vision-Language Models in Understanding Image Transforms	Mar 12, 2025	Question AnsweringVideo Generation	—Unverified
BEARCUBS: A benchmark for computer-using web agents	Mar 10, 2025	Video Understanding	—Unverified
ALLVB: All-in-One Long Video Understanding Benchmark	Mar 10, 2025	AllVideo Understanding	—Unverified
Towards Fine-Grained Video Question Answering	Mar 10, 2025	Language ModelingLanguage Modelling	—Unverified
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection	Mar 5, 2025	Anomaly DetectionObject	—Unverified
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models	Feb 28, 2025	Action UnderstandingText-to-Video Generation	—Unverified
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos	Feb 28, 2025	Question AnsweringVideo Understanding	—Unverified
M-LLM Based Video Frame Selection for Efficient Video Understanding	Feb 27, 2025	EgoSchemaLanguage Modeling	—Unverified
InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model	Feb 26, 2025	Video Quality AssessmentVideo Understanding	—Unverified
An Analysis of Data Transformation Effects on Segment Anything 2	Feb 25, 2025	Semantic SegmentationVideo Object Segmentation	—Unverified
Fine-Grained Video Captioning through Scene Graph Consolidation	Feb 23, 2025	Caption GenerationImage Captioning	—Unverified
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models	Feb 21, 2025	Caption GenerationVideo Captioning	—Unverified
AVD2: Accident Video Diffusion for Accident Video Description	Feb 20, 2025	Autonomous DrivingScene Understanding	—Unverified

Show:10 25 50

← PrevPage 11 of 23Next →

No leaderboard results yet.