Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 1149 papers

Title	Date	Tasks	Status
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding	Jul 17, 2025	Video GroundingVideo Understanding	—Unverified
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI	Jul 14, 2025	Large Language ModelMultimodal Large Language Model	—Unverified
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	Jul 14, 2025	Scene UnderstandingSpatial Reasoning	—Unverified
Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation	Jul 8, 2025	Depth EstimationDepth Prediction	—Unverified
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models	Jul 8, 2025	Future predictionLarge Language Model	—Unverified
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges	Jul 2, 2025	Video Understanding	—Unverified
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs	Jul 1, 2025	Text GenerationVideo Understanding	—Unverified
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment	Jun 28, 2025	Dynamic Time WarpingLarge Language Model	CodeCode Available
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs	Jun 27, 2025	MMEVideo MME	—Unverified
IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes	Jun 26, 2025	AttributeQuestion Answering	—Unverified
Task-Aware KV Compression For Cost-Effective Long Video Understanding	Jun 26, 2025	Video Understanding	CodeCode Available
PEVLM: Parallel Encoding for Vision-Language Models	Jun 24, 2025	Autonomous DrivingVideo Understanding	—Unverified
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning	Jun 19, 2025	Multimodal Reasoningreinforcement-learning	—Unverified
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	Jun 18, 2025	GPUStreaming video understanding	—Unverified
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization	Jun 17, 2025	Multi-Instance RetrievalRetrieval	CodeCode Available
MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models	Jun 16, 2025	Video Understanding	—Unverified
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding	Jun 16, 2025	Optical Character Recognition (OCR)RAG	CodeCode Available
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios	Jun 11, 2025	Action RecognitionAction Segmentation	CodeCode Available
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks	Jun 10, 2025	Multiple-choiceOpen-Ended Question Answering	—Unverified
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding	Jun 10, 2025	Language ModelingLanguage Modelling	—Unverified
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding	Jun 9, 2025	Contrastive LearningVideo Editing	—Unverified
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding	Jun 9, 2025	RAGRetrieval	—Unverified
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis	Jun 9, 2025	Action ClassificationBenchmarking	—Unverified
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision	Jun 6, 2025	Video Understanding	CodeCode Available
Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models	Jun 6, 2025	SegmentationVideo Understanding	—Unverified
DualX-VSR: Dual Axial SpatialTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation	Jun 5, 2025	Motion CompensationOptical Flow Estimation	—Unverified
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval	Jun 5, 2025	Information RetrievalRetrieval	—Unverified
TextVidBench: A Benchmark for Long Video Scene Text Understanding	Jun 5, 2025	Prompt EngineeringQuestion Answering	—Unverified
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding	Jun 4, 2025	MMEVideo MME	—Unverified
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding	Jun 3, 2025	Video Understanding	CodeCode Available
InterRVOS: Interaction-aware Referring Video Object Segmentation	Jun 3, 2025	8kObject	—Unverified
EgoVLM: Policy Optimization for Egocentric Video Understanding	Jun 3, 2025	EgoSchemaQuestion Answering	CodeCode Available
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding	Jun 2, 2025	Action RecognitionVideo Understanding	—Unverified
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding	Jun 1, 2025	Video Understanding	—Unverified
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis	May 31, 2025	Scene SegmentationSegmentation	—Unverified
Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders	May 30, 2025	Video Understanding	—Unverified
Learning reusable concepts across different egocentric video understanding tasks	May 30, 2025	Video Understanding	—Unverified
VUDG: A Dataset for Video Understanding Domain Generalization	May 30, 2025	Domain GeneralizationMultiple-choice	—Unverified
Time Blindness: Why Video-Language Models Can't See What Humans Can?	May 30, 2025	Temporal SequencesVideo Understanding	—Unverified
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding	May 29, 2025	AvgVideo Understanding	CodeCode Available
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection	May 29, 2025	image-classificationImage Classification	—Unverified
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding	May 29, 2025	RAGRetrieval-augmented Generation	—Unverified
Universal Visuo-Tactile Video Understanding for Embodied Interaction	May 28, 2025	FrictionLarge Language Model	—Unverified
Two Causally Related Needles in a Video Haystack	May 26, 2025	Video UnderstandingVisual Grounding	—Unverified
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos	May 26, 2025	AttributeVideo Understanding	CodeCode Available
AdaTP: Attention-Debiased Token Pruning for Video Large Language Models	May 26, 2025	Video Understanding	—Unverified
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs	May 25, 2025	Video Understanding	—Unverified
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding	May 23, 2025	FormQuestion Answering	—Unverified
SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding	May 22, 2025	Action ClassificationAutomatic Speech Recognition	CodeCode Available

Show:10 25 50

← PrevPage 9 of 23Next →

No leaderboard results yet.