SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 551600 of 1149 papers

TitleStatusHype
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
VEU-Bench: Towards Comprehensive Understanding of Video Editing0
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning0
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models0
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation0
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models0
Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks0
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation0
Perceptron Synthesis Network: Rethinking the Action Scale Variances in Videos0
Video Domain Incremental Learning for Human Action Recognition in Home Environments0
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models0
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding0
VideoGLUE: Video General Understanding Evaluation of Foundation Models0
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models0
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding0
Video Language Model Pretraining with Spatio-temporal Masking0
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges0
VideoLLM Benchmarks and Evaluation: A Survey0
VideoMCC: a New Benchmark for Video Comprehension0
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition0
VideoPrism: A Foundational Visual Encoder for Video Understanding0
Videoprompter: an ensemble of foundational models for zero-shot video understanding0
Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling0
Video RWKV:Video Action Recognition Based RWKV0
VideoSAVi: Self-Aligned Video Language Models without Human Supervision0
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers0
Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 20220
Video Time: Properties, Encoders and Evaluation0
Video Token Merging for Long-form Video Understanding0
Video Understanding as Machine Translation0
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs0
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding0
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding0
VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery0
ViFi-ReID: A Two-Stream Vision-WiFi Multimodal Approach for Person Re-identification0
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation0
Visual Context Window Extension: A New Perspective for Long Video Understanding0
Visual Subtitle Feature Enhanced Video Outline Generation0
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding0
VRDFormer: End-to-End Video Visual Relation Detection With Transformers0
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning0
VUDG: A Dataset for Video Understanding Domain Generalization0
Wasserstein Dependency Measure for Representation Learning0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Weakly Supervised Multiclass Video Segmentation0
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models0
Show:102550
← PrevPage 12 of 23Next →

No leaderboard results yet.