SOTAVerified

Video Understanding

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Papers

Showing 401425 of 1149 papers

TitleStatusHype
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
EAGLE: Egocentric AGgregated Language-video Engine0
E.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingCode2
LLM4Brain: Training a Large Language Model for Brain Video Understanding0
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP0
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingCode4
Towards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
Interpretable Action Recognition on Hard to Classify Actions0
AMEGO: Active Memory from long EGOcentric videos0
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions0
SoccerNet 2024 Challenges ResultsCode0
Enhancing Long Video Understanding via Hierarchical Event-Based Memory0
VidLPRO: A Video-Language Pre-training Framework for Robotic and Laparoscopic Surgery0
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations0
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid ArchitectureCode3
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges0
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models0
Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring0
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning0
CogVLM2: Visual Language Models for Image and Video UnderstandingCode9
Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input0
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long VideosCode2
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification0
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal ModelsCode0
Show:102550
← PrevPage 17 of 46Next →

No leaderboard results yet.