SOTAVerified

Video Question Answering

Papers

Showing 401450 of 460 papers

TitleStatusHype
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering0
Leveraging Video Descriptions to Learn Video Question Answering0
VideoLLM-online: Online Video Large Language Model for Streaming Video0
EVQAScore: Efficient Video Question Answering Data Evaluation0
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer0
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment0
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling0
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval0
LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering0
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments0
ENTER: Event Based Interpretable Reasoning for VideoQA0
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation0
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning0
LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs0
Locate before Answering: Answer Guided Question Localization for Video Question Answering0
Admitting Ignorance Helps the Video Question Answering Models to Answer0
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization0
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling0
Efficient Motion-Aware Video MLLM0
VUDG: A Dataset for Video Understanding Domain Generalization0
MarioQA: Answering Questions by Watching Gameplay Videos0
Measuring Compositional Consistency for Video Question Answering0
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation0
VideoOrion: Tokenizing Object Dynamics in Videos0
Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering0
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities0
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction0
M-LLM Based Video Frame Selection for Efficient Video Understanding0
MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering0
Modality Alignment between Deep Representations for Effective Video-and-Language Learning0
Modality Shifting Attention Network for Multi-modal Video Question Answering0
Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering0
Modular Blended Attention Network for Video Question Answering0
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering0
Motion-Appearance Co-Memory Networks for Video Question Answering0
Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering0
Diversifying Joint Vision-Language Tokenization Learning0
Distraction-free Embeddings for Robust VQA0
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents0
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding0
Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering0
Dense but Efficient VideoQA for Intricate Compositional Reasoning0
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling0
VideoPrism: A Foundational Visual Encoder for Video Understanding0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering0
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge0
Multi-object event graph representation learning for Video Question Answering0
Multi-Scale Progressive Attention Network for Video Question Answering0
Data augmentation techniques for the Video Question Answering task0
Show:102550
← PrevPage 9 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Accuracy85.5Unverified
2InternVL-2.5(8B)Accuracy85.5Unverified
3VideoLLaMA3(7B)Accuracy84.5Unverified
4PLM-8BAccuracy84.1Unverified
5BIMBA-LLaVA-Qwen2-7BAccuracy83.73Unverified
6PLM-3BAccuracy83.4Unverified
7LLaVA-VideoAccuracy83.2Unverified
8NVILA(8B)Accuracy82.2Unverified
9Oryx-1.5(7B)Accuracy81.8Unverified
10Qwen2-VL(7B)Accuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)Accuracy61.2Unverified
2GPT-2 + CLIP-32 (Zero-Shot)Accuracy58.4Unverified
3VideoCoCaAccuracy56.1Unverified
4Mirasol3BAccuracy51.13Unverified
5VASTAccuracy50.4Unverified
6COSAAccuracy49.9Unverified
7MA-LMMAccuracy49.8Unverified
8VideoChat2Accuracy49.1Unverified
9VALORAccuracy48.6Unverified
10UMT-L (ViT-L/16)Accuracy47.9Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VL thinkingAverage Accuracy63.6Unverified
2PLM-8BAverage Accuracy63.5Unverified
3Seed1.5-VLAverage Accuracy61.5Unverified
4V-JEPA 2 ViT-g 8BAverage Accuracy60.6Unverified
5PLM-3BAverage Accuracy58.9Unverified
6RRPOAverage Accuracy56.5Unverified
7Tarsier-34BAverage Accuracy55.5Unverified
8Tarsier2-7BAverage Accuracy54.7Unverified
9Qwen2-VL-72BAverage Accuracy52.7Unverified
10IXC-2.5 7BAverage Accuracy51.6Unverified
#ModelMetricClaimedVerifiedStatus
1LinVT-Qwen2-VL (7B)Avg.69.3Unverified
2Tarsier (34B)Avg.67.6Unverified
3InternVideo2Avg.67.2Unverified
4LongVU (7B)Avg.66.9Unverified
5Oryx(34B)Avg.64.7Unverified
6VideoLLaMA2 (72B)Avg.62Unverified
7VideoChat-T (7B)Avg.59.9Unverified
8mPLUG-Owl3(7B)Avg.59.5Unverified
9PPLLaVA (7b)Avg.59.2Unverified
10VideoGPT+Avg.58.7Unverified
#ModelMetricClaimedVerifiedStatus
1Mirasol3BAccuracy50.42Unverified
2VASTAccuracy50.1Unverified
3COSAAccuracy49.2Unverified
4VALORAccuracy49.2Unverified
5MA-LMMAccuracy48.5Unverified
6mPLUG-2Accuracy48Unverified
7FrozenBiLMAccuracy47Unverified
8HBIAccuracy46.2Unverified
9EMCL-NetAccuracy45.8Unverified
10VindLUAccuracy44.6Unverified
#ModelMetricClaimedVerifiedStatus
1VLAP (4 frames)Average Accuracy67.1Unverified
2LLaMA-VQAAverage Accuracy65.4Unverified
3SeViLAAverage Accuracy64.9Unverified
4InternVideoAverage Accuracy58.7Unverified
5GF(sup)Average Accuracy53.94Unverified
6GF(uns)Average Accuracy53.86Unverified
7MISTAverage Accuracy51.13Unverified
8Temp[ATP]Average Accuracy48.37Unverified
9AnyMAL-70B (0-shot)Average Accuracy48.2Unverified
10All-in-oneAverage Accuracy47.5Unverified
#ModelMetricClaimedVerifiedStatus
1Seed1.5-VLAVG60Unverified
2VideoChat-Online (4B)AVG54.9Unverified
3Gemini-1.5-FlashAVG50.7Unverified
4Qwen2-VL (7B)AVG49.7Unverified
5LLaVA-OneVision (7B)AVG49.5Unverified
6InternVL2 (7B)AVG48.7Unverified
7InternVL2 (4B)AVG44.1Unverified
8LongVA (7B)AVG43.6Unverified
9LLaMA-VID (7B)AVG41.9Unverified
10MiniCPM-V 2.6 (7B)AVG39.1Unverified
#ModelMetricClaimedVerifiedStatus
1GF (sup) - Faster RCNNAverage Accuracy55.08Unverified
2MIST - CLIPAverage Accuracy54.39Unverified
3GF (uns) - S3DAverage Accuracy53.33Unverified
4SViTTAverage Accuracy52.7Unverified
5MIST - AIOAverage Accuracy50.96Unverified
6SHG-VQA (trained from scratch)Average Accuracy49.2Unverified
7AIO - ViTAverage Accuracy48.59Unverified
8MMTFAverage Accuracy44.36Unverified
#ModelMetricClaimedVerifiedStatus
1Text + Text (no Multimodal Pretext Training)Accuracy93.2Unverified
2FrozenBiLMAccuracy86.7Unverified
3Just AskAccuracy84.4Unverified
4SeViLAAccuracy83.7Unverified
5Hero w/ pre-trainingAccuracy77.75Unverified
6ATPAccuracy65.1Unverified
7FrozenBiLM (0-shot)Accuracy58.4Unverified
8Just Ask (0-shot)Accuracy51.1Unverified