SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 451500 of 2177 papers

TitleStatusHype
MapQA: A Dataset for Question Answering on Choropleth MapsCode1
Visual Named Entity Linking: A New Dataset and A BaselineCode1
VLC-BERT: Visual Question Answering with Contextualized Commonsense KnowledgeCode1
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot PromptingCode1
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric ModelsCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQACode1
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive LearningCode1
Linearly Mapping from Image to Text SpaceCode1
TVLT: Textless Vision-Language TransformerCode1
Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong BaselineCode1
MaXM: Towards Multilingual Visual Question AnsweringCode1
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA TaskCode1
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical ReasoningCode1
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal UnderstandingCode1
Generative Bias for Robust Visual Question AnsweringCode1
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text InjectionCode1
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question AnsweringCode1
Rethinking Data Augmentation for Robust Visual Question AnsweringCode1
ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named EntitiesCode1
Weakly Supervised Grounding for VQA in Vision-Language TransformersCode1
A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQACode1
Consistency-preserving Visual Question Answering in Medical ImagingCode1
Surgical-VQA: Visual Question Answering in Surgical Scenes using TransformerCode1
MixGen: A New Multi-Modal Data AugmentationCode1
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeCode1
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question AnsweringCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
Learning to Answer Visual Questions from Web VideosCode1
Declaration-based Prompt Tuning for Visual Question AnsweringCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Reliable Visual Question Answering: Abstain Rather Than Answer IncorrectlyCode1
GRIT: General Robust Image Task BenchmarkCode1
Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question AnsweringCode1
Attention in Reasoning: Dataset, Analysis, and ModelingCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question AnsweringCode1
Learning to Answer Questions in Dynamic Audio-Visual ScenariosCode1
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network CalibrationCode1
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question AnsweringCode1
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and LanguagesCode1
Maintaining Reasoning Consistency in Compositional Visual Question AnsweringCode1
LaTr: Layout-Aware Transformer for Scene-Text VQACode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question AnsweringCode1
Show:102550
← PrevPage 10 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified