SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 13011350 of 2177 papers

TitleStatusHype
AlignVE: Visual Entailment Recognition Based on Alignment Relations0
PromptCap: Prompt-Guided Task-Aware Image CaptioningCode1
MapQA: A Dataset for Question Answering on Choropleth MapsCode1
Visually Grounded VQA by Lattice-based RetrievalCode0
MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering0
Towards Reasoning-Aware Explainable VQA0
Visual Named Entity Linking: A New Dataset and A BaselineCode1
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation0
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?Code0
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems0
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question AnsweringCode0
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision0
VLC-BERT: Visual Question Answering with Contextualized Commonsense KnowledgeCode1
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data0
PoseScript: Linking 3D Human Poses and Natural LanguageCode2
Image Semantic Relation Generation0
CPL: Counterfactual Prompt Learning for Vision and Language Models0
Aligning MAGMA by Few-Shot Learning and Finetuning0
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering0
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero TrainingCode0
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot PromptingCode1
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric ModelsCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing0
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQACode1
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive LearningCode1
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning0
Retrieval Augmented Visual Question Answering with Outside KnowledgeCode2
On the Effects of Video Grounding on Language Models0
Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering0
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question AnsweringCode0
Task Formulation Matters When Learning Continually: A Case Study in Visual Question AnsweringCode0
Linearly Mapping from Image to Text SpaceCode1
TVLT: Textless Vision-Language TransformerCode1
RepsNet: Combining Vision with Language for Automated Medical Reports0
Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong BaselineCode1
Exploring Modulated Detection Transformer as a Tool for Action Recognition in VideosCode0
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering0
Continual VQA for Disaster Response SystemsCode0
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar InstancesCode0
LAVIS: A Library for Language-Vision Intelligence0
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering0
MUST-VQA: MUltilingual Scene-text VQA0
PaLI: A Jointly-Scaled Multilingual Language-Image Model0
PreSTU: Pre-Training for Scene-Text Understanding0
MaXM: Towards Multilingual Visual Question AnsweringCode1
Pre-training image-language transformers for open-vocabulary tasks0
Improving the Cross-Lingual Generalisation in Visual Question AnsweringCode0
Show:102550
← PrevPage 27 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified