SOTAVerified

Video Question Answering

Papers

Showing 351400 of 460 papers

TitleStatusHype
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis0
First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge0
VITED: Video Temporal Evidence Distillation0
A Review of Deep Learning for Video Captioning0
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework0
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering0
GPT-4o System Card0
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation0
Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering0
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding0
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering0
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending0
Video Dialog via Progressive Inference and Cross-Transformer0
VideoDistill: Language-aware Vision Distillation for Video Question Answering0
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering0
Hierarchical Memory for Long Video QA0
Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training0
Holistic Multi-modal Memory Network for Movie Question Answering0
How Can Objects Help Video-Language Understanding?0
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?0
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation0
HySTER: A Hybrid Spatio-Temporal Event Reasoner0
In-the-Wild Video Question Answering0
Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering0
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering0
IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs0
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability0
Is a Video worth n n Images? A Highly Efficient Approach to Transformer-based Video Question Answering0
Zero-Shot Video Question Answering with Procedural Programs0
KeyVideoLLM: Towards Large-scale Video Keyframe Selection0
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering0
KnowIT VQA: Answering Knowledge-Based Questions about Videos0
Video Instruction Tuning With Synthetic Data0
Knowledge-Based Visual Question Answering in Videos0
Knowledge Proxy Intervention for Deconfounded Video Question Answering0
Koala: Key frame-conditioned long video-LLM0
Language-aware Visual Semantic Distillation for Video Question Answering0
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering0
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering0
Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA0
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning0
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering0
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering0
Adversarial Multimodal Network for Movie Question Answering0
Advancing Egocentric Video Question Answering with Multimodal Large Language Models0
Neural Reasoning, Fast and Slow, for Video Question Answering0
Learning to Rehearse in Long Sequence Memorization0
Learning Trajectory-Word Alignments for Video-Language Tasks0
Leveraging Static Relationships for Intra-Type and Inter-Type Message Passing in Video Question Answering0
Show:102550
← PrevPage 8 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVL-2.5(8B)Accuracy85.5Unverified
2LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified