SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 10511100 of 1149 papers

TitleStatusHype
Recurring the Transformer for Video Action Recognition0
Relational Space-Time Query in Long-Form Videos0
Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition0
ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task0
Rethinking Image-to-Video Adaptation: An Object-centric Perspective0
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data0
Retrieval-based Video Language Model for Efficient Long Video Question Answering0
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning0
Revealing Occlusions with 4D Neural Fields0
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding0
ReWind: Understanding Long Videos with Instructed Learnable Memory0
SA-NET.v2: Real-time vehicle detection from oblique UAV images with use of uncertainty estimation in deep meta-learning0
SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context0
Scene-centric Joint Parsing of Cross-view Videos0
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis0
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding0
MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization0
SEAL: Semantic Attention Learning for Long Video Representation0
Search-Map-Search: A Frame Selection Paradigm for Action Recognition0
Seed1.5-VL Technical Report0
Selective Structured State-Spaces for Long-Form Video Understanding0
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization0
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding0
Self-supervised Motion Representation via Scattering Local Motion Cues0
Self-Supervised Object Detection from Egocentric Videos0
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction0
Self-supervised video pretraining yields robust and more human-aligned visual representations0
Semantics-aware Test-time Adaptation for 3D Human Pose Estimation0
Semantic Segmentation on VSPW Dataset through Masked Video Consistency0
Semi-Parametric Video-Grounded Text Generation0
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding0
ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries0
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation0
Skimming and Scanning for Untrimmed Video Action Recognition0
Slicing Convolutional Neural Network for Crowd Video Understanding0
Slot-VLM: SlowFast Slots for Video-Language Modeling0
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability0
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding0
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs0
Spatio-Temporal Context for Action Detection0
Spatio-Temporal Crop Aggregation for Video Representation Learning0
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos0
Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction0
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain0
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos0
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions0
SPOT! Revisiting Video-Language Models for Event Understanding0
Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips0
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training0
Show:102550
← PrevPage 22 of 23Next →

No leaderboard results yet.