SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 451500 of 1149 papers

TitleStatusHype
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering0
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling0
DOAD: Decoupled One Stage Action Detection Network0
In-the-Wild Video Question Answering0
BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation0
IPAD: Industrial Process Anomaly Detection Dataset0
ALLVB: All-in-One Long Video Understanding Benchmark0
IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs0
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory0
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models0
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation0
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?0
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model0
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition0
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output0
Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals0
Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection0
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input0
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning0
KeyVideoLLM: Towards Large-scale Video Keyframe Selection0
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection0
KnowIT VQA: Answering Knowledge-Based Questions about Videos0
Knowledge-Based Visual Question Answering in Videos0
Koala: Key frame-conditioned long video-LLM0
Instrument-tissue Interaction Detection Framework for Surgical Video Understanding0
InstructionBench: An Instructional Video Understanding Benchmark0
Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding0
AVT: Audio-Video Transformer for Multimodal Action Recognition0
Aligned Better, Listen Better for Audio-Visual Large Language Models0
Disentangle and denoise: Tackling context misalignment for video moment retrieval0
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding0
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs0
Large-Scale Video Classification with Feature Space Augmentation coupled with Learned Label Relations and Ensembling0
Large Scale Video Representation Learning via Relational Graph Clustering0
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks0
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision0
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training0
Beyond the Camera: Neural Networks in World Coordinates0
Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition0
Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection0
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval0
Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer0
Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking0
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment0
Learning from Multiple Sources for Video Summarisation0
Learning Higher-order Object Interactions for Keypoint-based Video Understanding0
Inductive Attention for Video Action Anticipation0
Discrete neural representations for explainable anomaly detection0
Improving Video Model Transfer With Dynamic Representation Learning0
Improving LLM Video Understanding with 16 Frames Per Second0
Show:102550
← PrevPage 10 of 23Next →

No leaderboard results yet.