SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 251300 of 2177 papers

TitleStatusHype
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
Localized Questions in Medical Visual Question AnsweringCode1
Learning Situation Hyper-Graphs for Video Question AnsweringCode1
Learning Trimodal Relation for AVQA with Missing ModalityCode1
Beyond Embeddings: The Promise of Visual Table in Visual ReasoningCode1
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairsCode1
LaPA: Latent Prompt Assist Model For Medical Visual Question AnsweringCode1
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray ImagesCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-AttentionCode1
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQACode1
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
Advancing High Resolution Vision-Language Models in BiomedicineCode1
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question AnsweringCode1
Language Repository for Long Video UnderstandingCode1
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art BaselineCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text InjectionCode1
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningCode1
Dynamic Language Binding in Relational Visual ReasoningCode1
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched PromptsCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Bayesian Attention ModulesCode1
Label-Descriptive Patterns and Their Application to Characterizing Classification ErrorsCode1
Language-Informed Visual Concept LearningCode1
LaTr: Layout-Aware Transformer for Scene-Text VQACode1
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQACode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
JDocQA: Japanese Document Question Answering Dataset for Generative Language ModelsCode1
Dual-Key Multimodal Backdoors for Visual Question AnsweringCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
Visual Grounding Methods for VQA are Working for the Wrong Reasons!Code1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
A Dataset and Baselines for Visual Question Answering on ArtCode1
Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset BiasesCode1
An Approach to Solving the Abstraction and Reasoning Corpus (ARC) ChallengeCode1
Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical ReasoningCode1
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question AnsweringCode1
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4Code1
Does Vision-and-Language Pretraining Improve Lexical Grounding?Code1
Instruction-Guided Visual MaskingCode1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation EmbeddingCode1
LIVE: Learnable In-Context Vector for Visual Question AnsweringCode1
LXMERT: Learning Cross-Modality Encoder Representations from TransformersCode1
Disentangling 3D Prototypical Networks For Few-Shot Concept LearningCode1
In Defense of Grid Features for Visual Question AnsweringCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Show:102550
← PrevPage 6 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified