Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 1149 papers

Title	Date	Tasks	Status	Hype
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation	Mar 25, 2025	HallucinationHallucination Evaluation	CodeCode Available	1
PAVE: Patching and Adapting Video Large Language Models	Mar 25, 2025	Audio-visual Question AnsweringMulti-Task Learning	CodeCode Available	1
ACVUBench: Audio-Centric Video Understanding Benchmark	Mar 25, 2025	Video Understanding	CodeCode Available	0
CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos	Mar 24, 2025	Anomaly DetectionAnomaly Detection In Surveillance Videos	—Unverified	0
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding	Mar 24, 2025	FormVideo Understanding	—Unverified	0
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding	Mar 24, 2025	8kGPU	—Unverified	0
Breaking the Encoder Barrier for Seamless Video-Language Understanding	Mar 24, 2025	DecoderLanguage Modeling	—Unverified	0
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks	Mar 24, 2025	Common Sense ReasoningPrediction	—Unverified	0
MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps	Mar 23, 2025	Scene SegmentationVideo Understanding	CodeCode Available	1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction	Mar 22, 2025	BenchmarkingVideo Understanding	CodeCode Available	1
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding	Mar 22, 2025	BenchmarkingObject	CodeCode Available	0
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization	Mar 22, 2025	Saliency DetectionSentence	—Unverified	0
Temporal Action Detection Model Compression by Progressive Block Drop	Mar 21, 2025	Action DetectionAutonomous Driving	—Unverified	0
PVChat: Personalized Video Chat with One-Shot Learning	Mar 21, 2025	One-Shot LearningQuestion Answering	—Unverified	0
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?	Mar 20, 2025	DecoderGraph Generation	—Unverified	0
Agentic Keyframe Search for Video Question Answering	Mar 20, 2025	EgoSchemaQuestion Answering	CodeCode Available	1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models	Mar 20, 2025	Multiple-choiceVideo Understanding	CodeCode Available	1
XAttention: Block Sparse Attention with Antidiagonal Scoring	Mar 20, 2025	Video GenerationVideo Understanding	CodeCode Available	3
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering	Mar 20, 2025	Contrastive LearningQuestion Answering	—Unverified	0
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations	Mar 20, 2025	HallucinationVideo Understanding	—Unverified	0
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding	Mar 20, 2025	Video UnderstandingZero-shot Generalization	CodeCode Available	1
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	Mar 19, 2025	BenchmarkingMultiple-choice	—Unverified	0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability	Mar 18, 2025	Language ModelingLanguage Modelling	—Unverified	0
Improving LLM Video Understanding with 16 Frames Per Second	Mar 18, 2025	MMEVideo MME	—Unverified	0
Impossible Videos	Mar 18, 2025	counterfactualVideo Generation	—Unverified	0
ViSpeak: Visual Instruction Feedback in Streaming Videos	Mar 17, 2025	Streaming video understandingVideo Understanding	CodeCode Available	2
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	Mar 17, 2025	Grounded Video Question AnsweringQuestion Answering	CodeCode Available	3
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory	Mar 17, 2025	FormGPU	—Unverified	0
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition	Mar 17, 2025	Action RecognitionVideo Recognition	—Unverified	0
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding	Mar 17, 2025	AttributeMME	—Unverified	0
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?	Mar 16, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding	Mar 16, 2025	Video Understanding	CodeCode Available	2
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding	Mar 14, 2025	DenoisingDense Video Captioning	—Unverified	0
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs	Mar 14, 2025	Video Understanding	—Unverified	0
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers	Mar 14, 2025	GPUMamba	—Unverified	0
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	Mar 14, 2025	BenchmarkingRelational Reasoning	—Unverified	0
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs	Mar 13, 2025	BenchmarkingQuestion Answering	—Unverified	0
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing	Mar 13, 2025	EgoSchemaForm	CodeCode Available	0
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents	Mar 13, 2025	Computational EfficiencyOptical Character Recognition (OCR)	—Unverified	0
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation	Mar 12, 2025	Allcounterfactual	—Unverified	0
On the Limitations of Vision-Language Models in Understanding Image Transforms	Mar 12, 2025	Question AnsweringVideo Generation	—Unverified	0
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization	Mar 12, 2025	Temporal LocalizationVideo Understanding	—Unverified	0
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models	Mar 12, 2025	Mixture-of-ExpertsQuestion Answering	—Unverified	0
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment	Mar 12, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	Mar 12, 2025	GPUStreaming video understanding	—Unverified	0
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary	Mar 12, 2025	EgoSchemaRetrieval	CodeCode Available	4
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding	Mar 12, 2025	Instruction FollowingVideo Understanding	—Unverified	0
Generative Frame Sampler for Long Video Understanding	Mar 12, 2025	Video Understanding	—Unverified	0
Memory-enhanced Retrieval Augmentation for Long Video Understanding	Mar 12, 2025	RAGRetrieval	—Unverified	0
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension	Mar 11, 2025	AutoMLDecoder	CodeCode Available	2

Show:10 25 50

← PrevPage 4 of 23Next →

No leaderboard results yet.