SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 851900 of 2177 papers

TitleStatusHype
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering0
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models0
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning0
Aligning Modalities in Vision Large Language Models via Preference Fine-tuningCode2
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language ModelsCode3
CoLLaVO: Crayon Large Language and Vision mOdelCode2
II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question AnsweringCode0
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models0
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language ModelsCode1
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter0
Prompt-based Personalized Federated Learning for Medical Visual Question Answering0
Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-raysCode0
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLMCode4
Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models0
PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal RetrieversCode3
Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks0
Visually Dehallucinative Instruction GenerationCode0
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language ModelsCode4
Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image DataCode0
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs0
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to PairsCode3
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyCode1
Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & HallucinationsCode1
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive SurveyCode3
CIC: A Framework for Culturally-Aware Image Captioning0
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel ImagesCode0
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsCode7
Convincing Rationales for Visual Question Answering ReasoningCode0
Text-Guided Image ClusteringCode1
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationCode4
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question AnsweringCode2
Knowledge Generation for Zero-shot Knowledge-based VQACode0
Instruction Makes a DifferenceCode0
Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems0
From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information0
MouSi: Poly-Visual-Expert Vision-Language ModelsCode2
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model0
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
MoE-LLaVA: Mixture of Experts for Large Vision-Language ModelsCode7
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering0
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning0
Free Form Medical Visual Question Answering in Radiology0
Small Language Model Meets with Reinforced Vision Vocabulary0
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities0
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World KnowledgeCode1
Veagle: Advancements in Multimodal Representation LearningCode1
Question-Answer Cross Language Image Matching for Weakly Supervised Semantic SegmentationCode1
COCO is "ALL'' You Need for Visual Instruction Fine-tuning0
Uncovering the Full Potential of Visual Grounding Methods in VQACode0
BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining0
Show:102550
← PrevPage 18 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified