SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 701750 of 1149 papers

TitleStatusHype
C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues0
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition0
CAG-QIL: Context-Aware Actionness Grouping via Q Imitation Learning for Online Temporal Action Localization0
Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting0
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP0
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning0
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks?0
Can Temporal Information Help with Contrastive Self-Supervised Learning?0
Can't Fool Me: Adversarially Robust Transformer for Video Understanding0
CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning0
Causal Reasoning Meets Visual Representation Learning: A Prospective Study0
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis0
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos0
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System0
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI0
CinePile: A Long Video Question Answering Dataset and Benchmark0
Clapper: Compact Learning and Video Representation in VLMs0
ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation0
CLIP4Caption: CLIP for Video Caption0
Co-attentional Transformers for Story-Based Video Understanding0
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework0
CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding0
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization0
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs0
Comprehensive Video Understanding: Video summarization with content-based video recommender design0
Compressed Vision for Efficient Video Understanding0
Concept Graph Neural Networks for Surgical Video Understanding0
Constructing Hierarchical Q&A Datasets for Video Story Understanding0
ContextDet: Temporal Action Detection with Adaptive Context Aggregation0
Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries0
Contrastive Language-Action Pre-training for Temporal Localization0
Contrastive Language Video Time Pre-training0
CoS: Chain-of-Shot Prompting for Long Video Understanding0
CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos0
Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization0
Cross-Class Relevance Learning for Temporal Concept Localization0
CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding0
CTM: Collaborative Temporal Modeling for Action Recognition0
Cultivating DNN Diversity for Large Scale Video Labelling0
Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data0
Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model0
Cycle-Contrast for Self-Supervised Video Representation Learning0
DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description0
Deep learning for action spotting in association football videos0
Deep Spatio-Temporal Random Fields for Efficient Video Segmentation0
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding0
DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding0
Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection0
Show:102550
← PrevPage 15 of 23Next →

No leaderboard results yet.