Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 1149 papers

Title	Date	Tasks	Status
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering	Mar 20, 2025	Contrastive LearningQuestion Answering	—Unverified
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	Jan 21, 2025	Object TrackingReferring Expression Segmentation	—Unverified
DOAD: Decoupled One Stage Action Detection Network	Apr 1, 2023	Action DetectionAction Recognition	—Unverified
In-the-Wild Video Question Answering	Oct 1, 2022	Evidence SelectionQuestion Answering	—Unverified
BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation	Aug 1, 2022	ObjectOptical Flow Estimation	—Unverified
IPAD: Industrial Process Anomaly Detection Dataset	Apr 23, 2024	Anomaly DetectionVideo Anomaly Detection	—Unverified
ALLVB: All-in-One Long Video Understanding Benchmark	Mar 10, 2025	AllVideo Understanding	—Unverified
IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs	Dec 13, 2024	Question AnsweringVideo Question Answering	—Unverified
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory	Mar 17, 2025	FormGPU	—Unverified
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models	Feb 4, 2025	GPUVideo Understanding	—Unverified
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Jul 13, 2023	Action RecognitionContrastive Learning	—Unverified
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?	Apr 2, 2025	Action RecognitionAll	—Unverified
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	Jan 21, 2025	Instruction FollowingMathematical Reasoning	—Unverified
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition	Jan 11, 2019	Action ClassificationAction Recognition	—Unverified
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	Jul 3, 2024	ArticlesImage Comprehension	—Unverified
Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals	Jul 1, 2017	Video Understanding	—Unverified
Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection	Dec 6, 2024	GPUMulti-Object Tracking	—Unverified
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input	Aug 28, 2024	Language ModelingLanguage Modelling	—Unverified
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning	Aug 29, 2024	Multi-Task LearningPrompt Learning	—Unverified
KeyVideoLLM: Towards Large-scale Video Keyframe Selection	Jul 3, 2024	Data CompressionManagement	—Unverified
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection	Nov 27, 2018	Objectobject-detection	—Unverified
KnowIT VQA: Answering Knowledge-Based Questions about Videos	Oct 23, 2019	Question AnsweringVideo Question Answering	—Unverified
Knowledge-Based Visual Question Answering in Videos	Apr 17, 2020	Question AnsweringVideo Question Answering	—Unverified
Koala: Key frame-conditioned long video-LLM	Apr 5, 2024	Action RecognitionQuestion Answering	—Unverified
Instrument-tissue Interaction Detection Framework for Surgical Video Understanding	Mar 30, 2024	Video Understanding	—Unverified
InstructionBench: An Instructional Video Understanding Benchmark	Apr 7, 2025	Common Sense ReasoningMultiple-choice	—Unverified
Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding	Oct 31, 2021	Action RecognitionText Detection	—Unverified
AVT: Audio-Video Transformer for Multimodal Action Recognition	Sep 22, 2022	Action RecognitionAudio Classification	—Unverified
Aligned Better, Listen Better for Audio-Visual Large Language Models	Apr 2, 2025	Video Understanding	—Unverified
Disentangle and denoise: Tackling context misalignment for video moment retrieval	Aug 14, 2024	DenoisingDisentanglement	—Unverified
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	Jun 18, 2025	GPUStreaming video understanding	—Unverified
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs	Jun 5, 2025	BenchmarkingVideo Understanding	—Unverified
Large-Scale Video Classification with Feature Space Augmentation coupled with Learned Label Relations and Ensembling	Sep 21, 2018	General ClassificationVideo Classification	—Unverified
Large Scale Video Representation Learning via Relational Graph Clustering	Jun 1, 2020	ClusteringGraph Clustering	—Unverified
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks	Jun 14, 2017	ClassificationGeneral Classification	—Unverified
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision	Apr 15, 2023	Language ModelingLanguage Modelling	—Unverified
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training	Nov 23, 2022	Action RecognitionTemporal Action Localization	—Unverified
Beyond the Camera: Neural Networks in World Coordinates	Mar 12, 2020	Action RecognitionVideo Stabilization	—Unverified
Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition	Dec 13, 2018	3D Action RecognitionAction Recognition	—Unverified
Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection	Aug 8, 2021	Action DetectionKnowledge Distillation	—Unverified
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval	Apr 3, 2025	Information RetrievalRepresentation Learning	—Unverified
Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer	Sep 19, 2023	AnatomyComputational Efficiency	—Unverified
Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking	Jun 7, 2021	Graph Neural NetworkMulti-Person Pose Estimation	—Unverified
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment	Jun 8, 2023	Video Understanding	—Unverified
Learning from Multiple Sources for Video Summarisation	Jan 13, 2015	ClusteringVideo Understanding	—Unverified
Learning Higher-order Object Interactions for Keypoint-based Video Understanding	May 16, 2023	Action LocalizationAction Recognition	—Unverified
Inductive Attention for Video Action Anticipation	Dec 17, 2022	Action AnticipationAction Recognition	—Unverified
Discrete neural representations for explainable anomaly detection	Dec 10, 2021	Anomaly DetectionObject	—Unverified
Improving Video Model Transfer With Dynamic Representation Learning	Jan 1, 2022	Action ClassificationKnowledge Distillation	—Unverified
Improving LLM Video Understanding with 16 Frames Per Second	Mar 18, 2025	MMEVideo MME	—Unverified

Show:10 25 50

← PrevPage 10 of 23Next →

No leaderboard results yet.