SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 501550 of 1149 papers

TitleStatusHype
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video UnderstandingCode0
A Coding Framework and Benchmark towards Low-Bitrate Video UnderstandingCode0
Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly VideosCode0
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric OptimizationCode0
Pooled Motion Features for First-Person VideosCode0
CARPe Posterum: A Convolutional Approach for Real-time Pedestrian Path PredictionCode0
Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action RecognitionCode0
Are current long-term video understanding datasets long-term?Code0
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and BenchmarkCode0
Enhancing Temporal Modeling of Video LLMs via Time GatingCode0
On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow AnalysisCode0
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under OcclusionsCode0
End-to-End Learning of Motion Representation for Video UnderstandingCode0
NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy LabelsCode0
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video ClassificationCode0
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video UnderstandingCode0
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal TokensCode0
Long-Term Feature Banks for Detailed Video UnderstandingCode0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
Multi-attention Networks for Temporal Localization of Video-level LabelsCode0
MOFO: MOtion FOcused Self-Supervision for Video UnderstandingCode0
Localizing Moments in Video with Temporal LanguageCode0
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal ModelsCode0
MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept LocalizationCode0
Multimodal Dialogue State TrackingCode0
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric VisionCode0
LLaVA-OneVision: Easy Visual Task TransferCode0
METok: Multi-Stage Event-based Token Compression for Efficient Long Video UnderstandingCode0
MINOTAUR: Multi-task Video Grounding From Multimodal QueriesCode0
Masked Autoencoders for Egocentric Video Understanding @ Ego4D Challenge 2022Code0
A Challenge to Build Neuro-Symbolic Video AgentsCode0
Representation Flow for Action RecognitionCode0
Learning to Visually Connect Actions and their Effects0
Learning to Focus on the Foreground for Temporal Sentence Grounding0
Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey0
Learning text-to-video retrieval from image captioning0
Learning Space-Time Semantic Correspondences0
An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform0
Learning reusable concepts across different egocentric video understanding tasks0
EAGLE: Egocentric AGgregated Language-video Engine0
Learning Object State Changes in Videos: An Open-World Perspective0
Learning Higher-order Object Interactions for Keypoint-based Video Understanding0
Learning from Multiple Sources for Video Summarisation0
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding0
BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes0
An Attempt towards Interpretable Audio-Visual Video Captioning0
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction0
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment0
Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking0
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding0
Show:102550
← PrevPage 11 of 23Next →

No leaderboard results yet.