SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 601650 of 2177 papers

TitleStatusHype
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question AnsweringCode1
Check It Again:Progressive Visual Question Answering via Visual EntailmentCode1
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step VerificationCode1
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly DetectionCode1
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question AnsweringCode1
CaMML: Context-Aware Multimodal Learner for Large ModelsCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg VideosCode1
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real ImagesCode1
FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene UnderstandingCode1
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot PromptingCode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
Dynamic Language Binding in Relational Visual ReasoningCode1
Florence: A New Foundation Model for Computer VisionCode1
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyCode1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ VideosCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-AnsweringCode1
Faithful Multimodal Explanation for Visual Question AnsweringCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
OmniNet: A unified architecture for multi-modal multi-task learningCode0
DVQA: Understanding Data Visualizations via Question AnsweringCode0
OmniFusion Technical ReportCode0
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual DialogueCode0
Dual Recurrent Attention Units for Visual Question AnsweringCode0
Bridging Vision and Language Spaces with Assignment PredictionCode0
Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question AnsweringCode0
OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer Learning for Telepresence RoboticsCode0
On Modality Bias Recognition and ReductionCode0
Dual Attention Networks for Visual Reference Resolution in Visual DialogCode0
Dual Attention Networks for Multimodal Reasoning and MatchingCode0
Object Attribute Matters in Visual Question AnsweringCode0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document ImagesCode0
Object-aware Adaptive-Positivity Learning for Audio-Visual Question AnsweringCode0
Towards Flexible Evaluation for Generative Visual Question AnsweringCode0
Answer Them All! Toward Universal Visual Question Answering ModelsCode0
Neural Module NetworksCode0
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language UnderstandingCode0
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question AnsweringCode0
Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachCode0
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question AnsweringCode0
NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional GeneralizationCode0
No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only MemoryCode0
MUTAN: Multimodal Tucker Fusion for Visual Question AnsweringCode0
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical StudyCode0
Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMsCode0
Multi-Sourced Compositional Generalization in Visual Question AnsweringCode0
Show:102550
← PrevPage 13 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified