SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 251275 of 2177 papers

TitleStatusHype
MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease ProgressionCode1
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code1
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language ModelsCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?Code1
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question AnsweringCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
MedCoT: Medical Chain of Thought via Hierarchical ExpertCode1
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical AssistantsCode1
IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design PatentsCode1
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language ModelsCode1
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of ExpertsCode1
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at ScaleCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsCode1
Cross-modal Information Flow in Multimodal Large Language ModelsCode1
Teaching VLMs to Localize Specific Objects from In-context ExamplesCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change DetectionCode1
Nearest Neighbor Normalization Improves Multimodal RetrievalCode1
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningCode1
Progressive Compositionality In Text-to-Image Generative ModelsCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
VividMed: Vision Language Model with Versatile Visual Grounding for MedicineCode1
Show:102550
← PrevPage 11 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified