SOTAVerified

Video Question Answering

Papers

Showing 110 of 460 papers

TitleStatusHype
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and GrounderCode1
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMsCode2
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
CogStream: Context-guided Streaming Video Question Answering0
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and PlanningCode7
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue ReasoningCode0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
VUDG: A Dataset for Video Understanding Domain Generalization0
Show:102550
← PrevPage 1 of 46Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified