SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 19512000 of 2177 papers

TitleStatusHype
Query and Attention Augmentation for Knowledge-Based Explainable ReasoningCode0
Dynamic Memory Networks for Visual and Textual Question AnsweringCode0
LayoutLMv3: Pre-training for Document AI with Unified Text and Image MaskingCode0
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document UnderstandingCode0
An Improved Attention for Visual Question AnsweringCode0
TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored ModelsCode0
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language ModelsCode0
Adaptively Clustering Neighbor Elements for Image-Text GenerationCode0
Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question AnsweringCode0
DVQA: Understanding Data Visualizations via Question AnsweringCode0
CLEVR\_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over ImagesCode0
Tips and Tricks for Visual Question Answering: Learnings from the 2017 ChallengeCode0
CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over ImagesCode0
CLEAR: A Dataset for Compositional Language and Elementary Acoustic ReasoningCode0
QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person ViewCode0
Latent Alignment and Variational AttentionCode0
Large Models in Dialogue for Active Perception and Anomaly DetectionCode0
Large Language Models Understand LayoutCode0
Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal EndoscopyCode0
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual DialogueCode0
Dual Recurrent Attention Units for Visual Question AnsweringCode0
Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQACode0
Dual Attention Networks for Visual Reference Resolution in Visual DialogCode0
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture UnderstandingCode0
CAST: Cross-modal Alignment Similarity Test for Vision Language ModelsCode0
RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order LogicCode0
Kvasir-VQA: A Text-Image Pair GI Tract DatasetCode0
A Neuro-Symbolic ASP Pipeline for Visual Question AnsweringCode0
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean LanguageCode0
Knowledge Generation for Zero-shot Knowledge-based VQACode0
Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical KnowledgeCode0
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic GroundingCode0
Dual Attention Networks for Multimodal Reasoning and MatchingCode0
Recommending Themes for Ad Creative Design via Visual-Linguistic RepresentationsCode0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document ImagesCode0
Recursive Visual Attention in Visual DialogCode0
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language ModelsCode0
ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mappingCode0
Towards a performance analysis on pre-trained Visual Question Answering models for autonomous drivingCode0
Cascaded Mutual Modulation for Visual ReasoningCode0
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task LearningCode0
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative ReasoningCode0
Towards a Unified Multimodal Reasoning FrameworkCode0
Relation-Aware Graph Attention Network for Visual Question AnsweringCode0
'Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) TasksCode0
Adaptive loose optimization for robust question answeringCode0
REMIND Your Neural Network to Prevent Catastrophic ForgettingCode0
Bridging Vision and Language Spaces with Assignment PredictionCode0
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language ModelsCode0
Joint Answering and Explanation for Visual Commonsense ReasoningCode0
Show:102550
← PrevPage 40 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified