SOTAVerified

Video Question Answering

Papers

Showing 101150 of 460 papers

TitleStatusHype
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsCode1
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language UnderstandingCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
Equivariant and Invariant Grounding for Video Question AnsweringCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding BridgeCode1
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question AnsweringCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
SViTT: Temporal Learning of Sparse Video-Text TransformersCode1
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding EvaluationCode1
Scene-Text Grounding for Text-Based Video Question AnsweringCode1
Encoding and Controlling Global Semantics for Long-form Video Question AnsweringCode1
Empowering Large Language Model for Continual Video Question Answering with Collaborative PromptingCode1
Self-Chained Image-Language Model for Video Localization and Question AnsweringCode1
Connecting Vision and Language with Video Localized NarrativesCode1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual ScenesCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric VideosCode1
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question AnsweringCode1
Contrastive Video Question Answering via Video Graph TransformerCode1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringCode1
FunQA: Towards Surprising Video ComprehensionCode1
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text ModelsCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Can I Trust Your Answer? Visually Grounded Video Question AnsweringCode1
A Comprehensive Review of the Video-to-Text ProblemCode1
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition BenchmarkCode1
Revisiting the "Video" in Video-Language UnderstandingCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question AnsweringCode1
BIMBA: Selective-Scan Compression for Long-Range Video Question AnsweringCode1
Location-aware Graph Convolutional Networks for Video Question AnsweringCode1
Revitalize Region Feature for Democratizing Video-Language Pre-training of RetrievalCode1
LifeQA: A Real-life Dataset for Video Question AnsweringCode1
Discovering Spatio-Temporal Rationales for Video Question AnsweringCode1
Referring Atomic Video Action RecognitionCode1
Learning to Answer Visual Questions from Web VideosCode1
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringCode1
Learning Situation Hyper-Graphs for Video Question AnsweringCode1
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringCode1
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and GrounderCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
Language Models with Image Descriptors are Strong Few-Shot Video-Language LearnersCode1
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingCode1
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQACode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Show:102550
← PrevPage 3 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
2InternVL-2.5(8B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified