Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 551–600 of 1149 papers

Title	Date	Tasks	Status
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval	Feb 18, 2025	Action RecognitionMoment Retrieval	—Unverified
iMOVE: Instance-Motion-Aware Video Understanding	Feb 17, 2025	Computational EfficiencyVideo Understanding	—Unverified
Semantics-aware Test-time Adaptation for 3D Human Pose Estimation	Feb 15, 2025	3D human pose and shape estimation3D Human Pose Estimation	—Unverified
Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering	Feb 13, 2025	ClassificationPrompt Engineering	—Unverified
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis	Feb 11, 2025	Action RecognitionVideo Description	—Unverified
A Survey on Mamba Architecture for Vision Applications	Feb 11, 2025	Mambaobject-detection	—Unverified
A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems	Feb 10, 2025	Autonomous DrivingEdge-computing	—Unverified
CoS: Chain-of-Shot Prompting for Long Video Understanding	Feb 10, 2025	Video Understanding	—Unverified
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs	Feb 6, 2025	Video Understanding	—Unverified
MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding	Feb 5, 2025	DiversityEgoSchema	—Unverified
A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions	Feb 5, 2025	Action Quality AssessmentSurvey	—Unverified
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models	Feb 4, 2025	GPUVideo Understanding	—Unverified
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding	Jan 28, 2025	DecoderVideo Understanding	—Unverified
Understanding Long Videos via LLM-Powered Entity Relation Graphs	Jan 27, 2025	EgoSchemaLarge Language Model	—Unverified
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding	Jan 25, 2025	Action UnderstandingEmotion Recognition	—Unverified
Temporal Preference Optimization for Long-Form Video Understanding	Jan 23, 2025	FormMME	—Unverified
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	Jan 21, 2025	Instruction FollowingMathematical Reasoning	—Unverified
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	Jan 21, 2025	Object TrackingReferring Expression Segmentation	—Unverified
HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition	Jan 19, 2025	Action RecognitionRelation Classification	—Unverified
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks	Jan 14, 2025	Language ModelingLanguage Modelling	—Unverified
Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling	Jan 13, 2025	Video Quality AssessmentVideo Understanding	—Unverified
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding	Jan 12, 2025	Video Understanding	—Unverified
Zero-shot Shark Tracking and Biometrics from Aerial Imagery	Jan 10, 2025	Video Understanding	—Unverified
LongViTU: Instruction Tuning for Long-Form Video Understanding	Jan 9, 2025	EgoSchemaForm	—Unverified
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding	Jan 9, 2025	Language ModelingLanguage Modelling	—Unverified
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving	Jan 8, 2025	Autonomous DrivingMamba	—Unverified
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs	Jan 8, 2025	EgoSchemaObject Tracking	—Unverified
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models	Jan 6, 2025	BenchmarkingFeature Compression	—Unverified
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding	Jan 3, 2025	Question AnsweringVideo Understanding	CodeCode Available
HuMoCon: Concept Discovery for Human Motion Understanding	Jan 1, 2025	Video Understanding	—Unverified
Efficient Motion-Aware Video MLLM	Jan 1, 2025	Question AnsweringVideo Question Answering	—Unverified
VEU-Bench: Towards Comprehensive Understanding of Video Editing	Jan 1, 2025	Video EditingVideo Understanding	—Unverified
Video Language Model Pretraining with Spatio-temporal Masking	Jan 1, 2025	DecoderLanguage Modeling	—Unverified
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs	Jan 1, 2025	Multiple-choiceVideo Generation	—Unverified
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction	Jan 1, 2025	GPUQuestion Answering	—Unverified
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding	Jan 1, 2025	Question AnsweringVideo Understanding	—Unverified
Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception	Jan 1, 2025	Autonomous DrivingGesture Recognition	—Unverified
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models	Jan 1, 2025	Action LocalizationTemporal Action Localization	—Unverified
Flexible Frame Selection for Efficient Video Reasoning	Jan 1, 2025	Language ModelingLanguage Modelling	—Unverified
OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models	Dec 31, 2024	Activity RecognitionHuman Interaction Recognition	—Unverified
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding	Dec 31, 2024	Robot ManipulationScene Understanding	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
Detection-Fusion for Knowledge Graph Extraction from Videos	Dec 30, 2024	Knowledge GraphsLanguage Modeling	CodeCode Available
MVTamperBench: Evaluating Robustness of Vision-Language Models	Dec 27, 2024	Video Understanding	—Unverified
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries	Dec 26, 2024	Question AnsweringVideo Question Answering	—Unverified
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data	Dec 23, 2024	Action RecognitionVideo Understanding	—Unverified
Video Domain Incremental Learning for Human Action Recognition in Home Environments	Dec 22, 2024	Action Recognitionclass-incremental learning	—Unverified
FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos	Dec 22, 2024	Language ModellingLarge Language Model	CodeCode Available
ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries	Dec 17, 2024	Human Detectionimage-classification	—Unverified
FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering	Dec 17, 2024	Language ModelingLanguage Modelling	—Unverified

Show:10 25 50

← PrevPage 12 of 23Next →

No leaderboard results yet.