SOTAVerified

Video Question Answering

Papers

Showing 101150 of 460 papers

TitleStatusHype
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsCode1
Revisiting the "Video" in Video-Language UnderstandingCode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
Knowledge-Based Video Question Answering with Unsupervised Scene DescriptionsCode1
Revitalize Region Feature for Democratizing Video-Language Pre-training of RetrievalCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text ModelsCode1
Equivariant and Invariant Grounding for Video Question AnsweringCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
Referring Atomic Video Action RecognitionCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
Paxion: Patching Action Knowledge in Video-Language Foundation ModelsCode1
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding EvaluationCode1
BT-Adapter: Video Conversation is Feasible Without Video Instruction TuningCode1
Connecting Vision and Language with Video Localized NarrativesCode1
Tem-adapter: Adapting Image-Text Pretraining for Video Question AnswerCode1
Encoding and Controlling Global Semantics for Long-form Video Question AnsweringCode1
Empowering Large Language Model for Continual Video Question Answering with Collaborative PromptingCode1
On the hidden treasure of dialog in video question answeringCode1
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual ScenesCode1
Contrastive Video Question Answering via Video Graph TransformerCode1
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-AnsweringCode1
FunQA: Towards Surprising Video ComprehensionCode1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric VideosCode1
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question AnsweringCode1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric VideosCode1
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering ModelsCode1
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
Can I Trust Your Answer? Visually Grounded Video Question AnsweringCode1
A Comprehensive Review of the Video-to-Text ProblemCode1
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition BenchmarkCode1
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question AnsweringCode1
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question AnsweringCode1
Invariant Grounding for Video Question AnsweringCode1
BIMBA: Selective-Scan Compression for Long-Range Video Question AnsweringCode1
NExT-QA: Next Phase of Question-Answering to Explaining Temporal ActionsCode1
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsCode1
MECD+: Unlocking Event-Level Causal Graph Discovery for Video ReasoningCode1
Discovering Spatio-Temporal Rationales for Video Question AnsweringCode1
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQACode1
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringCode1
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and GrounderCode1
Location-aware Graph Convolutional Networks for Video Question AnsweringCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding BridgeCode1
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question AnsweringCode1
Show:102550
← PrevPage 3 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1InternVL-2.5(8B)Accuracy85.5Unverified
2LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified