SOTAVerified

Video Question Answering

Papers

Showing 151200 of 460 papers

TitleStatusHype
IntentQA: Context-aware Video Intent ReasoningCode1
Discovering Spatio-Temporal Rationales for Video Question AnsweringCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMsCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQACode1
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringCode1
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and GrounderCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding BridgeCode1
Video Dialog as Conversation about Objects Living in Space-TimeCode1
Video Graph Transformer for Video Question AnsweringCode1
Video-Language Alignment via Spatio-Temporal Graph TransformerCode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
Hierarchical Conditional Relation Networks for Video Question AnsweringCode1
Video as Conditional Graph Hierarchy for Multi-Granular Question AnsweringCode1
DAM: Dynamic Adapter Merging for Continual Video QA LearningCode1
Knowledge-Based Video Question Answering with Unsupervised Scene DescriptionsCode1
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
On the hidden treasure of dialog in video question answeringCode1
HawkEye: Training Video-Text LLMs for Grounding Text in VideosCode1
Agentic Keyframe Search for Video Question AnsweringCode1
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding EvaluationCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
AssistSR: Task-oriented Video Segment Retrieval for Personal AI AssistantCode1
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingCode1
Grounded Question-Answering in Long Egocentric VideosCode1
Paxion: Patching Action Knowledge in Video-Language Foundation ModelsCode1
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question AnsweringCode1
Grounded Multi-Hop VideoQA in Long-Form Egocentric VideosCode1
Encoding and Controlling Global Semantics for Long-form Video Question AnsweringCode1
Glance and Focus: Memory Prompting for Multi-Event Video Question AnsweringCode1
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractionsCode1
Location-aware Graph Convolutional Networks for Video Question AnsweringCode1
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingCode1
VideoCon: Robust Video-Language Alignment via Contrast CaptionsCode1
Visual Commonsense-aware Representation Network for Video CaptioningCode1
FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story VideosCode0
TutorialVQA: Question Answering Dataset for Tutorial VideosCode0
TVQA: Localized, Compositional Video Question AnsweringCode0
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life VideosCode0
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation FrameworkCode0
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language RetrievalCode0
TVQA+: Spatio-Temporal Grounding for Video Question AnsweringCode0
Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring TasksCode0
Extending Compositional Attention Networks for Social Reasoning in VideosCode0
Show:102550
← PrevPage 4 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
2InternVL-2.5(8B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified