Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–600 of 1149 papers

Title	Date	Tasks	Status
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment	Jun 16, 2024	Action UnderstandingBenchmarking	—Unverified
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks	Jun 10, 2025	Multiple-choiceOpen-Ended Question Answering	—Unverified
VEU-Bench: Towards Comprehensive Understanding of Video Editing	Jan 1, 2025	Video EditingVideo Understanding	—Unverified
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning	May 21, 2025	Pseudo LabelReinforcement Learning (RL)	—Unverified
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models	Nov 16, 2024	HallucinationVideo Generation	—Unverified
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation	Dec 12, 2024	Phrase GroundingQuestion Answering	—Unverified
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models	Oct 15, 2024	Video Understanding	—Unverified
Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks	Feb 24, 2023	ClassificationData Augmentation	—Unverified
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	Mar 18, 2024	EgoSchemaVideo Understanding	—Unverified
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation	Nov 20, 2024	ChatbotMultiple-choice	—Unverified
Perceptron Synthesis Network: Rethinking the Action Scale Variances in Videos	Jul 22, 2020	Action RecognitionTemporal Action Localization	—Unverified
Video Domain Incremental Learning for Human Action Recognition in Home Environments	Dec 22, 2024	Action Recognitionclass-incremental learning	—Unverified
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models	Jul 8, 2025	Future predictionLarge Language Model	—Unverified
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding	Apr 10, 2025	Instruction FollowingVideo Understanding	—Unverified
VideoGLUE: Video General Understanding Evaluation of Foundation Models	Jul 6, 2023	Action RecognitionTemporal Localization	—Unverified
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Dec 31, 2023	Spatio-Temporal Video GroundingVideo Grounding	—Unverified
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Jan 1, 2024	Spatio-Temporal Video GroundingVideo Grounding	—Unverified
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	Jun 24, 2024	HallucinationVideo Understanding	—Unverified
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	Jul 17, 2025	Video GroundingVideo Understanding	—Unverified
Video Language Model Pretraining with Spatio-temporal Masking	Jan 1, 2025	DecoderLanguage Modeling	—Unverified
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges	Sep 2, 2024	GPUMVBench	—Unverified
VideoLLM Benchmarks and Evaluation: A Survey	May 3, 2025	SurveyVideo Understanding	—Unverified
VideoMCC: a New Benchmark for Video Comprehension	Jun 23, 2016	Multiple-choiceVideo Description	—Unverified
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	May 7, 2024	Large Language ModelMultimodal Large Language Model	—Unverified
VideoPrism: A Foundational Visual Encoder for Video Understanding	Feb 20, 2024	Question AnsweringVideo Question Answering	—Unverified
Videoprompter: an ensemble of foundational models for zero-shot video understanding	Oct 23, 2023	Action RecognitionDescriptive	—Unverified
Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling	Jan 13, 2025	Video Quality AssessmentVideo Understanding	—Unverified
Video RWKV:Video Action Recognition Based RWKV	Nov 8, 2024	Action RecognitionRepresentation Learning	—Unverified
VideoSAVi: Self-Aligned Video Language Models without Human Supervision	Dec 1, 2024	EgoSchemaMVBench	—Unverified
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	Mar 12, 2025	GPUStreaming video understanding	—Unverified
Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022	Jul 22, 2022	ObjectObject State Change Classification	—Unverified
Video Time: Properties, Encoders and Evaluation	Jul 18, 2018	Video Understanding	—Unverified
Video Token Merging for Long-form Video Understanding	Oct 31, 2024	FormVideo Classification	—Unverified
Video Understanding as Machine Translation	Jun 12, 2020	Machine TranslationMetric Learning	—Unverified
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs	Jul 2, 2024	Video Understanding	—Unverified
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding	Mar 24, 2025	8kGPU	—Unverified
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding	Dec 4, 2024	HallucinationInstruction Following	—Unverified
VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery	Sep 7, 2024	Computational EfficiencyContrastive Learning	—Unverified
ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification	Oct 13, 2024	Contrastive LearningPerson Re-Identification	—Unverified
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation	Dec 1, 2024	Instruction FollowingVideo Understanding	—Unverified
Visual Context Window Extension: A New Perspective for Long Video Understanding	Sep 30, 2024	Video Understanding	—Unverified
Visual Subtitle Feature Enhanced Video Outline Generation	Aug 24, 2022	ArticlesHeadline Generation	—Unverified
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding	May 20, 2021	Action SegmentationLanguage Modeling	—Unverified
VRDFormer: End-to-End Video Visual Relation Detection With Transformers	Jan 1, 2022	ObjectRelation	—Unverified
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning	Mar 14, 2025	BenchmarkingRelational Reasoning	—Unverified
VUDG: A Dataset for Video Understanding Domain Generalization	May 30, 2025	Domain GeneralizationMultiple-choice	—Unverified
Wasserstein Dependency Measure for Representation Learning	Mar 28, 2019	Object Recognitionreinforcement-learning	—Unverified
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding	Mar 14, 2025	DenoisingDense Video Captioning	—Unverified
Weakly Supervised Multiclass Video Segmentation	Jun 1, 2014	SegmentationSemantic Similarity	—Unverified
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models	Jan 1, 2025	Action LocalizationTemporal Action Localization	—Unverified

Show:10 25 50

← PrevPage 12 of 23Next →

No leaderboard results yet.