Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 701–750 of 1149 papers

Title	Date	Tasks	Status
C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues	Jun 16, 2021	Contrastive Learningcounterfactual	—Unverified
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition	Mar 30, 2025	Action ClassificationAction Recognition	—Unverified
CAG-QIL: Context-Aware Actionness Grouping via Q Imitation Learning for Online Temporal Action Localization	Jan 1, 2021	Action LocalizationImitation Learning	—Unverified
Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting	Apr 19, 2021	Action SpottingCamera Calibration	—Unverified
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP	Sep 23, 2024	Image GenerationQuestion Answering	—Unverified
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning	Oct 20, 2024	DiagnosticVideo Captioning	—Unverified
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks?	Nov 13, 2024	Action LocalizationTemporal Action Localization	—Unverified
Can Temporal Information Help with Contrastive Self-Supervised Learning?	Nov 25, 2020	Data AugmentationRepresentation Learning	—Unverified
Can't Fool Me: Adversarially Robust Transformer for Video Understanding	Oct 26, 2021	image-classificationImage Classification	—Unverified
CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning	May 1, 2020	DiagnosticObject	—Unverified
Causal Reasoning Meets Visual Representation Learning: A Prospective Study	Apr 26, 2022	BenchmarkingOut-of-Distribution Generalization	—Unverified
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs	Jul 1, 2025	Text GenerationVideo Understanding	—Unverified
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding	Dec 16, 2024	HallucinationMultiple-choice	—Unverified
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis	May 14, 2024	4kGPU	—Unverified
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos	Apr 25, 2018	General ClassificationVideo Classification	—Unverified
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	Apr 27, 2023	Video Understanding	—Unverified
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI	Jul 14, 2025	Large Language ModelMultimodal Large Language Model	—Unverified
CinePile: A Long Video Question Answering Dataset and Benchmark	May 14, 2024	FormHuman-Object Interaction Detection	—Unverified
Clapper: Compact Learning and Video Representation in VLMs	May 21, 2025	Video Understanding	—Unverified
ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation	Mar 19, 2021	ObjectReferring Expression Segmentation	—Unverified
CLIP4Caption: CLIP for Video Caption	Oct 13, 2021	DecoderSentence	—Unverified
Co-attentional Transformers for Story-Based Video Understanding	Oct 27, 2020	Question AnsweringVideo Question Answering	—Unverified
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework	Dec 11, 2024	GPULanguage Modeling	—Unverified
CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding	Jul 21, 2021	Question AnsweringSentence	—Unverified
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization	Mar 22, 2025	Saliency DetectionSentence	—Unverified
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs	May 6, 2024	Autonomous VehiclesVideo Understanding	—Unverified
Comprehensive Video Understanding: Video summarization with content-based video recommender design	Oct 30, 2019	Action RecognitionData Augmentation	—Unverified
Compressed Vision for Efficient Video Understanding	Oct 6, 2022	Video CompressionVideo Understanding	—Unverified
Concept Graph Neural Networks for Surgical Video Understanding	Feb 27, 2022	Video Understanding	—Unverified
Constructing Hierarchical Q&A Datasets for Video Story Understanding	Apr 1, 2019	Video Understanding	—Unverified
ContextDet: Temporal Action Detection with Adaptive Context Aggregation	Oct 20, 2024	Action DetectionVideo Understanding	—Unverified
Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries	Apr 3, 2020	Referring Expression SegmentationVideo Segmentation	—Unverified
Contrastive Language-Action Pre-training for Temporal Localization	Apr 26, 2022	Action LocalizationContrastive Learning	—Unverified
Contrastive Language Video Time Pre-training	Jun 4, 2024	Action RecognitionContrastive Learning	—Unverified
CoS: Chain-of-Shot Prompting for Long Video Understanding	Feb 10, 2025	Video Understanding	—Unverified
CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos	Mar 24, 2025	Anomaly DetectionAnomaly Detection In Surveillance Videos	—Unverified
Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization	Jan 1, 2021	Action LocalizationVideo Understanding	—Unverified
Cross-Class Relevance Learning for Temporal Concept Localization	Nov 19, 2019	Feature EngineeringVideo Understanding	—Unverified
CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding	Jan 17, 2024	Contrastive Learningpoint cloud video understanding	—Unverified
CTM: Collaborative Temporal Modeling for Action Recognition	Feb 8, 2020	Action RecognitionVideo Understanding	—Unverified
Cultivating DNN Diversity for Large Scale Video Labelling	Jul 13, 2017	DiversityVideo Understanding	—Unverified
Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data	Jan 17, 2020	Graph LearningVideo Understanding	—Unverified
Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model	Jan 29, 2024	Action DetectionAction Localization	—Unverified
Cycle-Contrast for Self-Supervised Video Representation Learning	Oct 28, 2020	Action RecognitionContrastive Learning	—Unverified
DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description	Mar 31, 2025	Video DescriptionVideo Understanding	—Unverified
Deep learning for action spotting in association football videos	Oct 2, 2024	Action SpottingBenchmarking	—Unverified
Deep Spatio-Temporal Random Fields for Efficient Video Segmentation	Jul 3, 2018	Instance SegmentationSemantic Segmentation	—Unverified
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding	May 23, 2025	FormQuestion Answering	—Unverified
DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding	May 19, 2018	Action Recognition In VideosGesture Recognition	—Unverified
Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection	Jul 29, 2020	object-detectionObject Detection	—Unverified

Show:10 25 50

← PrevPage 15 of 23Next →

No leaderboard results yet.