Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1051–1100 of 1149 papers

Title	Date	Tasks	Status
Recurring the Transformer for Video Action Recognition	Jan 1, 2022	Action RecognitionGPU	—Unverified
Relational Space-Time Query in Long-Form Videos	Jan 1, 2023	FormVideo Understanding	—Unverified
Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition	Aug 3, 2020	Action RecognitionOptical Flow Estimation	—Unverified
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task	Apr 20, 2025	Language ModelingLanguage Modelling	—Unverified
Rethinking Image-to-Video Adaptation: An Object-centric Perspective	Jul 9, 2024	Action RecognitionObject	—Unverified
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data	Jul 18, 2024	Language ModellingLarge Language Model	—Unverified
Retrieval-based Video Language Model for Efficient Long Video Question Answering	Dec 8, 2023	Language ModelingLanguage Modelling	—Unverified
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning	May 11, 2024	Image-text matchingRetrieval	—Unverified
Revealing Occlusions with 4D Neural Fields	Apr 22, 2022	Video Understanding	—Unverified
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding	Sep 20, 2023	Action LocalizationForm	—Unverified
ReWind: Understanding Long Videos with Instructed Learnable Memory	Nov 23, 2024	Large Language ModelQuestion Answering	—Unverified
SA-NET.v2: Real-time vehicle detection from oblique UAV images with use of uncertainty estimation in deep meta-learning	Aug 4, 2022	Meta-LearningSemantic Segmentation	—Unverified
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context	Nov 25, 2024	Large Language ModelMME	—Unverified
Scene-centric Joint Parsing of Cross-view Videos	Sep 16, 2017	Video Understanding	—Unverified
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis	May 31, 2025	Scene SegmentationSegmentation	—Unverified
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding	Jun 9, 2025	RAGRetrieval	—Unverified
MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization	Apr 6, 2022	Action LocalizationAction Recognition	—Unverified
SEAL: Semantic Attention Learning for Long Video Representation	Dec 2, 2024	DiversityQuestion Answering	—Unverified
Search-Map-Search: A Frame Selection Paradigm for Action Recognition	Apr 20, 2023	Action RecognitionHeuristic Search	—Unverified
Seed1.5-VL Technical Report	May 11, 2025	Mixture-of-ExpertsMultimodal Reasoning	—Unverified
Selective Structured State-Spaces for Long-Form Video Understanding	Mar 25, 2023	Contrastive LearningForm	—Unverified
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization	Apr 16, 2025	HallucinationQuestion Answering	—Unverified
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding	Mar 26, 2025	GPUQuestion Answering	—Unverified
Self-supervised Motion Representation via Scattering Local Motion Cues	Aug 1, 2020	Action RecognitionOptical Flow Estimation	—Unverified
Self-Supervised Object Detection from Egocentric Videos	Jan 1, 2023	Class-agnostic Object DetectionObject	—Unverified
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction	Nov 28, 2018	Action RecognitionPrediction	—Unverified
Self-supervised video pretraining yields robust and more human-aligned visual representations	Oct 12, 2022	Contrastive Learningobject-detection	—Unverified
Semantics-aware Test-time Adaptation for 3D Human Pose Estimation	Feb 15, 2025	3D human pose and shape estimation3D Human Pose Estimation	—Unverified
Semantic Segmentation on VSPW Dataset through Masked Video Consistency	Jun 7, 2024	Semantic SegmentationVideo Understanding	—Unverified
Semi-Parametric Video-Grounded Text Generation	Jan 27, 2023	Language ModelingLanguage Modelling	—Unverified
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding	Apr 10, 2025	Video Understanding	—Unverified
ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries	Dec 17, 2024	Human Detectionimage-classification	—Unverified
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation	May 13, 2025	Computational EfficiencyVideo Understanding	—Unverified
Skimming and Scanning for Untrimmed Video Action Recognition	Apr 21, 2021	Action RecognitionTemporal Action Localization	—Unverified
Slicing Convolutional Neural Network for Crowd Video Understanding	Jun 1, 2016	AttributeVideo Understanding	—Unverified
Slot-VLM: SlowFast Slots for Video-Language Modeling	Feb 20, 2024	Language ModelingLanguage Modelling	—Unverified
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding	Mar 24, 2025	FormVideo Understanding	—Unverified
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability	Mar 18, 2025	Language ModelingLanguage Modelling	—Unverified
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding	Nov 30, 2023	FormVideo Retrieval	—Unverified
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs	May 25, 2025	Video Understanding	—Unverified
Spatio-Temporal Context for Action Detection	Jun 29, 2021	Action DetectionVideo Understanding	—Unverified
Spatio-Temporal Crop Aggregation for Video Representation Learning	Nov 30, 2022	Action ClassificationDimensionality Reduction	—Unverified
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos	Jul 1, 2017	Action RecognitionAction Recognition In Videos	—Unverified
Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction	Oct 3, 2021	Action RecognitionRepresentation Learning	—Unverified
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain	Sep 29, 2022	Action RecognitionVideo Understanding	—Unverified
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos	Aug 9, 2024	Active Speaker LocalizationDecoder	—Unverified
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions	May 10, 2021	Contrastive LearningRetrieval	—Unverified
SPOT! Revisiting Video-Language Models for Event Understanding	Nov 21, 2023	AttributeVideo Understanding	—Unverified
Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips	Dec 2, 2021	Action RecognitionVideo Understanding	—Unverified
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training	Nov 29, 2024	Question AnsweringVideo Understanding	—Unverified

Show:10 25 50

← PrevPage 22 of 23Next →

No leaderboard results yet.