Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 1149 papers

Title	Date	Tasks	Status	Hype
BEARCUBS: A benchmark for computer-using web agents	Mar 10, 2025	Video Understanding	—Unverified	0
ALLVB: All-in-One Long Video Understanding Benchmark	Mar 10, 2025	AllVideo Understanding	—Unverified	0
Towards Fine-Grained Video Question Answering	Mar 10, 2025	Language ModelingLanguage Modelling	—Unverified	0
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos	Mar 9, 2025	Action LocalizationBoundary Detection	CodeCode Available	1
Unified Reward Model for Multimodal Understanding and Generation	Mar 7, 2025	Image Generationmodel	CodeCode Available	4
EgoLife: Towards Egocentric Life Assistant	Mar 5, 2025	Question AnsweringVideo Understanding	CodeCode Available	3
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection	Mar 5, 2025	Anomaly DetectionObject	—Unverified	0
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning	Mar 2, 2025	Large Language ModelMulti-Instance Retrieval	CodeCode Available	1
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models	Feb 28, 2025	Action UnderstandingText-to-Video Generation	—Unverified	0
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos	Feb 28, 2025	Question AnsweringVideo Understanding	—Unverified	0
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection	Feb 27, 2025	Action DetectionBenchmarking	CodeCode Available	3
M-LLM Based Video Frame Selection for Efficient Video Understanding	Feb 27, 2025	EgoSchemaLanguage Modeling	—Unverified	0
InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model	Feb 26, 2025	Video Quality AssessmentVideo Understanding	—Unverified	0
An Analysis of Data Transformation Effects on Segment Anything 2	Feb 25, 2025	Semantic SegmentationVideo Object Segmentation	—Unverified	0
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos	Feb 25, 2025	Graph LearningMistake Detection	CodeCode Available	1
Fine-Grained Video Captioning through Scene Graph Consolidation	Feb 23, 2025	Caption GenerationImage Captioning	—Unverified	0
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models	Feb 21, 2025	Caption GenerationVideo Captioning	—Unverified	0
AVD2: Accident Video Diffusion for Accident Video Description	Feb 20, 2025	Autonomous DrivingScene Understanding	—Unverified	0
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval	Feb 18, 2025	Action RecognitionMoment Retrieval	—Unverified	0
iMOVE: Instance-Motion-Aware Video Understanding	Feb 17, 2025	Computational EfficiencyVideo Understanding	—Unverified	0
VRoPE: Rotary Position Embedding for Video Large Language Models	Feb 17, 2025	PositionVideo Understanding	CodeCode Available	1
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model	Feb 17, 2025	Language ModelingLanguage Modelling	CodeCode Available	1
Semantics-aware Test-time Adaptation for 3D Human Pose Estimation	Feb 15, 2025	3D human pose and shape estimation3D Human Pose Estimation	—Unverified	0
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	Feb 15, 2025	Question AnsweringStreaming video understanding	CodeCode Available	2
Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering	Feb 13, 2025	ClassificationPrompt Engineering	—Unverified	0
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis	Feb 11, 2025	Action RecognitionVideo Description	—Unverified	0
A Survey on Mamba Architecture for Vision Applications	Feb 11, 2025	Mambaobject-detection	—Unverified	0
CoS: Chain-of-Shot Prompting for Long Video Understanding	Feb 10, 2025	Video Understanding	—Unverified	0
A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems	Feb 10, 2025	Autonomous DrivingEdge-computing	—Unverified	0
VideoRoPE: What Makes for Good Video Rotary Position Embedding?	Feb 7, 2025	HallucinationPosition	CodeCode Available	3
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray	Feb 7, 2025	4kGeneral Knowledge	CodeCode Available	3
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs	Feb 6, 2025	Video Understanding	—Unverified	0
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding	Feb 5, 2025	DiversityEgoSchema	—Unverified	0
A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions	Feb 5, 2025	Action Quality AssessmentSurvey	—Unverified	0
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes	Feb 4, 2025	Autonomous DrivingMultiple-choice	CodeCode Available	1
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives	Feb 4, 2025	Video Understanding	CodeCode Available	1
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models	Feb 4, 2025	GPUVideo Understanding	—Unverified	0
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos	Feb 3, 2025	Knowledge GraphsRAG	CodeCode Available	7
AIN: The Arabic INclusive Large Multimodal Model	Jan 31, 2025	document understandingmodel	CodeCode Available	2
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation	Jan 31, 2025	Question AnsweringVideo Question Answering	CodeCode Available	1
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding	Jan 28, 2025	DecoderVideo Understanding	—Unverified	0
Understanding Long Videos via LLM-Powered Entity Relation Graphs	Jan 27, 2025	EgoSchemaLarge Language Model	—Unverified	0
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding	Jan 26, 2025	Video Understanding	CodeCode Available	2
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding	Jan 25, 2025	Action UnderstandingEmotion Recognition	—Unverified	0
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	Jan 23, 2025	SchedulingStreaming video understanding	CodeCode Available	2
Temporal Preference Optimization for Long-Form Video Understanding	Jan 23, 2025	FormMME	—Unverified	0
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	Jan 22, 2025	PhilosophyVideo Question Answering	CodeCode Available	5
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	Jan 21, 2025	Object TrackingReferring Expression Segmentation	CodeCode Available	0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	Jan 21, 2025	Instruction FollowingMathematical Reasoning	CodeCode Available	0
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding	Jan 21, 2025	Video Understanding	CodeCode Available	2

Show:10 25 50

← PrevPage 5 of 23Next →

No leaderboard results yet.