SOTAVerified

Video Question Answering

Papers

Showing 401450 of 460 papers

TitleStatusHype
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability0
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question AnsweringCode0
Learning to Rehearse in Long Sequence Memorization0
Relation-aware Hierarchical Attention Framework for Video Question AnsweringCode0
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering0
Object-Centric Representation Learning for Video Question Answering0
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation FrameworkCode0
Video Question Answering with Phrases via Semantic Roles0
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning0
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning0
HySTER: A Hybrid Spatio-Temporal Event Reasoner0
Recent Advances in Video Question Answering: A Review of Datasets and Methods0
End-to-End Video Question-Answer Generation with Generator-Pretester NetworkCode0
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering0
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments0
Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature0
Trying Bilinear Pooling in Video-QA0
On Modality Bias in the TVQA DatasetCode0
Open-Ended Multi-Modal Relational Reasoning for Video Question AnsweringCode0
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering0
ActBERT: Learning Global-Local Video-Text RepresentationsCode0
Co-attentional Transformers for Story-Based Video Understanding0
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering0
Self-supervised pre-training and contrastive representation learning for multiple-choice video QA0
Data augmentation techniques for the Video Question Answering task0
Video Question Answering on Screencast Tutorials0
What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets0
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training0
Modality Shifting Attention Network for Multi-modal Video Question Answering0
DramaQA: Character-Centered Video Story Understanding with Hierarchical QACode0
Knowledge-Based Visual Question Answering in Videos0
Noise Estimation Using Density Estimation for Self-Supervised Multimodal LearningCode0
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge0
TutorialVQA: Question Answering Dataset for Tutorial VideosCode0
Video Dialog via Progressive Inference and Cross-Transformer0
KnowIT VQA: Answering Knowledge-Based Questions about Videos0
A Better Way to Attend: Attention with Trees for Video Question AnsweringCode0
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering0
OmniNet: A unified architecture for multi-modal multi-task learningCode0
Neural Reasoning, Fast and Slow, for Video Question Answering0
Video Question Generation via Cross-Modal Self-Attention Networks Learning0
Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks0
Adversarial Multimodal Network for Movie Question Answering0
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question AnsweringCode0
Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering0
TVQA+: Spatio-Temporal Grounding for Video Question AnsweringCode0
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question AnsweringCode0
Holistic Multi-modal Memory Network for Movie Question Answering0
TVQA: Localized, Compositional Video Question AnsweringCode0
A Joint Sequence Fusion Model for Video Question Answering and RetrievalCode0
Show:102550
← PrevPage 9 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
2InternVL-2.5(8B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified