SOTAVerified

Video Question Answering

Papers

Showing 51100 of 460 papers

TitleStatusHype
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AICode2
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video ModelsCode2
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language ModelsCode2
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive LearningCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
vid-TLDR: Training Free Token merging for Light-weight Video TransformerCode2
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
Video in 10 Bits: Few-Bit VideoQA for Efficiency and PrivacyCode2
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingCode2
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningCode2
Task Me AnythingCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
FreeVA: Offline MLLM as Training-Free Video AssistantCode2
Direct Preference Optimization of Video Large Multimodal Models from Language Model RewardCode2
Streaming Video Question-Answering with In-context Video KV-Cache RetrievalCode2
Revealing Single Frame Bias for Video-and-Language LearningCode2
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMsCode2
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsCode2
ST-LLM: Large Language Models Are Effective Temporal LearnersCode2
Perception Test: A Diagnostic Benchmark for Multimodal Video ModelsCode2
LingoQA: Visual Question Answering for Autonomous DrivingCode2
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMsCode2
Perception Test: A Diagnostic Benchmark for Multimodal ModelsCode2
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
OmniVid: A Generative Framework for Universal Video UnderstandingCode2
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular FusionCode2
Online Video Understanding: OVBench and VideoChat-OnlineCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
LITA: Language Instructed Temporal-Localization AssistantCode2
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
Contrastive Video Question Answering via Video Graph TransformerCode1
A Simple LLM Framework for Long-Range Video Question-AnsweringCode1
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
Connecting Vision and Language with Video Localized NarrativesCode1
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question AnsweringCode1
Language Models with Image Descriptors are Strong Few-Shot Video-Language LearnersCode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language ModelCode1
Knowledge-Based Video Question Answering with Unsupervised Scene DescriptionsCode1
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingCode1
Invariant Grounding for Video Question AnsweringCode1
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question AnsweringCode1
Clover: Towards A Unified Video-Language Alignment and Fusion ModelCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory ConsolidationCode1
Show:102550
← PrevPage 2 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
2InternVL-2.5(8B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified