SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 801825 of 1149 papers

TitleStatusHype
Vamos: Versatile Action Models for Video UnderstandingCode0
SPOT! Revisiting Video-Language Models for Event Understanding0
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab0
ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection0
Beyond still images: Temporal features and input variance resilience0
Videoprompter: an ensemble of foundational models for zero-shot video understanding0
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding0
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks0
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model0
Telling Stories for Common Sense Zero-Shot Action RecognitionCode0
M^33D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding0
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges0
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding0
Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer0
Language as the Medium: Multimodal Video Classification through text only0
Judging a video by its bitstream coverCode0
Motion-Guided Masking for Spatiotemporal Representation Learning0
MOFO: MOtion FOcused Self-Supervision for Video UnderstandingCode0
Are current long-term video understanding datasets long-term?Code0
Audio-Visual Glance Network for Efficient Video Recognition0
Temporally-Adaptive Models for Efficient Video Understanding0
M^3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition0
DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation0
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation0
HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge UnderstandingCode0
Show:102550
← PrevPage 33 of 46Next →

No leaderboard results yet.