SOTAVerified

Video Question Answering

Papers

Showing 110 of 460 papers

TitleStatusHype
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and GrounderCode1
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMsCode2
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
CogStream: Context-guided Streaming Video Question Answering0
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and PlanningCode7
Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue ReasoningCode0
EgoVLM: Policy Optimization for Egocentric Video UnderstandingCode0
VUDG: A Dataset for Video Understanding Domain Generalization0
Show:102550
← PrevPage 1 of 46Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy40.2Unverified
2FrozenBiLMAccuracy39.6Unverified
3VideoCoCaAccuracy39Unverified
4Co-TokenizationAccuracy38.2Unverified
5Just Ask (fine-tune)Accuracy35.4Unverified
6FrozenBiLM (0-shot)Accuracy26.8Unverified
7Just Ask (0-shot)Accuracy12.2Unverified
8FrozenBiLMAccuracy0.27Unverified