SOTAVerified

Video Question Answering

Papers

Showing 101125 of 460 papers

TitleStatusHype
Grounded Multi-Hop VideoQA in Long-Form Egocentric VideosCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
Video-Language Alignment via Spatio-Temporal Graph TransformerCode1
Referring Atomic Video Action RecognitionCode1
HCQA @ Ego4D EgoSchema Challenge 2024Code1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QACode1
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question AnsweringCode1
Encoding and Controlling Global Semantics for Long-form Video Question AnsweringCode1
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-AnsweringCode1
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual ScenesCode1
HawkEye: Training Video-Text LLMs for Grounding Text in VideosCode1
DAM: Dynamic Adapter Merging for Continual Video QA LearningCode1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding BridgeCode1
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question AnsweringCode1
Glance and Focus: Memory Prompting for Multi-Event Video Question AnsweringCode1
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional SportsCode1
A Simple LLM Framework for Long-Range Video Question-AnsweringCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
ViLA: Efficient Video-Language Alignment for Video Question AnsweringCode1
Grounded Question-Answering in Long Egocentric VideosCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringCode1
VideoCon: Robust Video-Language Alignment via Contrast CaptionsCode1
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language UnderstandingCode1
Show:102550
← PrevPage 5 of 19Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVL-2.5(8B)Accuracy85.5Unverified
2LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified