SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 501550 of 1149 papers

TitleStatusHype
Inductive Attention for Video Action Anticipation0
Discrete neural representations for explainable anomaly detection0
Improving Video Model Transfer With Dynamic Representation Learning0
Improving LLM Video Understanding with 16 Frames Per Second0
Discerning Generic Event Boundaries in Long-Form Wild Videos0
Action Understanding with Multiple Classes of Actors0
Impossible Videos0
Learning to Focus on the Foreground for Temporal Sentence Grounding0
iMOVE: Instance-Motion-Aware Video Understanding0
Identity-aware Graph Memory Network for Action Detection0
Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding0
AirLetters: An Open Video Dataset of Characters Drawn in the Air0
Action Sensitivity Learning for Temporal Action Localization0
i-Code: An Integrative and Composable Multimodal Learning Framework0
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding0
Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding0
HuMoCon: Concept Discovery for Human Motion Understanding0
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data0
Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection0
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding0
FE-Adapter: Adapting Image-based Emotion Classifiers to Videos0
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning0
DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding0
AVD2: Accident Video Diffusion for Accident Video Description0
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?0
How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos0
Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network0
MM-Ego: Towards Building Egocentric Multimodal LLMs0
How Can Objects Help Video-Language Understanding?0
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving0
HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do0
Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis0
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding0
Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search0
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning0
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding0
Deep Spatio-Temporal Random Fields for Efficient Video Segmentation0
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding0
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training0
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark0
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions0
HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition0
Deep learning for action spotting in association football videos0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description0
Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions0
Cycle-Contrast for Self-Supervised Video Representation Learning0
A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset0
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions0
Aggregating Frame-level Features for Large-Scale Video Classification0
Show:102550
← PrevPage 11 of 23Next →

No leaderboard results yet.