SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 10011050 of 1149 papers

TitleStatusHype
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?0
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets0
When Work Matters: Transforming Classical Network Structures to Graph CNN0
WildQA: In-the-Wild Video Question Answering0
Wolf: Captioning Everything with a World Summarization Framework0
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning0
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs0
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding0
YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset0
YouTube-8M Video Understanding Challenge Approach and Applications0
ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection0
Zero-shot Action Localization via the Confidence of Large Vision-Language Models0
Zero-Shot Action Recognition in Surveillance Videos0
Zero-Shot Action Recognition in Videos: A Survey0
Zero-Shot Long-Form Video Understanding through Screenplay0
Zero-shot Shark Tracking and Biometrics from Aerial Imagery0
Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network0
4D Generic Video Object ProposalsCode0
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal ModelsCode0
LLaVA-OneVision: Easy Visual Task TransferCode0
A Context-Aware Loss Function for Action Spotting in Soccer VideosCode0
Learnable pooling with Context Gating for video classificationCode0
Learnable Pooling Methods for Video ClassificationCode0
Leaping Into Memories: Space-Time Deep Feature SynthesisCode0
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video ProcessingCode0
Judging a video by its bitstream coverCode0
CARPe Posterum: A Convolutional Approach for Real-time Pedestrian Path PredictionCode0
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation ModelCode0
Joint Event Detection and Description in Continuous Video StreamsCode0
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video CaptioningCode0
Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action RecognitionCode0
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal TokensCode0
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action RecognitionCode0
ViP: Video Platform for PyTorchCode0
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding ValidationCode0
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric VisionCode0
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval ModelsCode0
https://arxiv.org/abs/2407.00634Code0
How Would The Viewer Feel? Estimating Wellbeing From Video ScenariosCode0
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person ScenariosCode0
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video RepresentationsCode0
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video UnderstandingCode0
The Visual Centrifuge: Model-Free Layered Video RepresentationsCode0
The YouTube-8M Kaggle Competition: Challenges and MethodsCode0
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal ModelCode0
The Monkeytyping Solution to the YouTube-8M Video Understanding ChallengeCode0
Hierarchical Deep Recurrent Architecture for Video UnderstandingCode0
Temporal Tessellation: A Unified Approach for Video AnalysisCode0
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video UnderstandingCode0
Temporal Modeling Approaches for Large-scale Youtube-8M Video UnderstandingCode0
Show:102550
← PrevPage 21 of 23Next →

No leaderboard results yet.