Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 801–850 of 1149 papers

Title	Date	Tasks	Status
Egocentric and Exocentric Methods: A Short Survey	Oct 27, 2024	Action RecognitionSurvey	—Unverified
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	Dec 6, 2024	document understandingHallucination	—Unverified
Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition	Jun 27, 2018	Action RecognitionTemporal Action Localization	—Unverified
Exploring Anchor-based Detection for Ego4D Natural Language Query	Aug 10, 2022	Video Understanding	—Unverified
Exploring Missing Modality in Multimodal Egocentric Datasets	Jan 21, 2024	Action RecognitionVideo Understanding	—Unverified
Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022	Nov 16, 2022	Human-Object Interaction DetectionObject	—Unverified
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding	Jan 28, 2025	DecoderVideo Understanding	—Unverified
Extending Video Masked Autoencoders to 128 frames	Nov 20, 2024	DecoderVideo Understanding	—Unverified
Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding	Aug 11, 2017	Action DetectionAction Recognition	—Unverified
Real-Time Segmentation Networks should be Latency Aware	Apr 6, 2020	Autonomous VehiclesScene Segmentation	—Unverified
Fast Retinomorphic Event Stream for Video Recognition and Reinforcement Learning	May 16, 2018	Action RecognitionAtari Games	—Unverified
FaVChat: Unlocking Fine-Grained Facail Video Understanding with Multimodal Large Language Models	Mar 12, 2025	Mixture-of-ExpertsQuestion Answering	—Unverified
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding	Mar 19, 2025	BenchmarkingMultiple-choice	—Unverified
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models	Jun 12, 2024	Video Understanding	—Unverified
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework	Nov 16, 2021	Multiple-choiceQuestion Answering	—Unverified
Fine-Grain Annotation of Cricket Videos	Nov 24, 2015	Action RecognitionRetrieval	—Unverified
Fine-Grained Video Captioning through Scene Graph Consolidation	Feb 23, 2025	Caption GenerationImage Captioning	—Unverified
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval	Dec 31, 2024	RetrievalText Retrieval	—Unverified
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge	Sep 20, 2024	Multiple-choiceQuestion Answering	—Unverified
Flatten: Video Action Recognition is an Image Classification task	Aug 17, 2024	Action Recognitionimage-classification	—Unverified
Flexible Frame Selection for Efficient Video Reasoning	Jan 1, 2025	Language ModelingLanguage Modelling	—Unverified
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding	Jun 1, 2025	Video Understanding	—Unverified
FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering	Dec 17, 2024	Language ModelingLanguage Modelling	—Unverified
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions	Sep 7, 2022	Image GenerationText to Image Generation	—Unverified
Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles	May 22, 2025	EgoSchemaFew-Shot Learning	—Unverified
Frame-Voyager: Learning to Query Frames for Video Large Language Models	Oct 4, 2024	Question AnsweringVideo Question Answering	—Unverified
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models	Apr 8, 2025	In-Context LearningInstruction Following	—Unverified
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction	Apr 8, 2025	Game State ReconstructionJersey Number Recognition	—Unverified
From Image to Video, what do we need in multimodal LLMs?	Apr 18, 2024	Video Understanding	—Unverified
From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations	May 18, 2025	Video EditingVideo Understanding	—Unverified
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment	Mar 26, 2025	Video Understanding	—Unverified
Fully Automated Hand Hygiene Monitoring\ Operating Room using 3D Convolutional Neural Network	Mar 20, 2020	Optical Flow EstimationTransfer Learning	—Unverified
Future semantic segmentation of time-lapsed videos with large temporal displacement	Dec 27, 2018	SegmentationSemantic Segmentation	—Unverified
Gameplay Highlights Generation	May 12, 2025	Event DetectionHighlight Detection	—Unverified
Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention	Apr 10, 2024	Action AnticipationGraph Neural Network	—Unverified
Generating the Future With Adversarial Transformers	Jul 1, 2017	Video Understanding	—Unverified
Generating Videos with Scene Dynamics	Sep 8, 2016	Action ClassificationFuture prediction	—Unverified
Generative Frame Sampler for Long Video Understanding	Mar 12, 2025	Video Understanding	—Unverified
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning	Jun 1, 2018	Action RecognitionRepresentation Learning	—Unverified
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning	Dec 10, 2024	cross-modal alignmentVideo Understanding	—Unverified
Global Motion Understanding in Large-Scale Video Object Segmentation	May 11, 2024	Instance SegmentationOptical Flow Estimation	—Unverified
Global Self-Attention Networks	Jan 1, 2021	Video Understanding	—Unverified
Global Self-Attention Networks for Image Recognition	Oct 6, 2020	Video Understanding	—Unverified
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding	Jun 14, 2024	Activity RecognitionMMR total	—Unverified
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	Nov 25, 2023	Instruction FollowingLanguage Modeling	—Unverified
Gradient Frequency Modulation for Visually Explaining Video Understanding Models	Nov 1, 2021	Action RecognitionTemporal Action Localization	—Unverified
GraphVid: It Only Takes a Few Nodes to Understand a Video	Jul 4, 2022	SuperpixelsVideo Understanding	—Unverified
Grounded Objects and Interactions for Video Captioning	Nov 16, 2017	ObjectScene Understanding	—Unverified
Grounded Video Situation Recognition	Oct 19, 2022	DescriptiveStructured Prediction	—Unverified
Grounding Action Descriptions in Videos	Jan 1, 2013	Semantic Textual SimilarityVideo Understanding	—Unverified

Show:10 25 50

← PrevPage 17 of 23Next →

No leaderboard results yet.