SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 251300 of 2177 papers

TitleStatusHype
MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease ProgressionCode1
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code1
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language ModelsCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?Code1
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question AnsweringCode1
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
MedCoT: Medical Chain of Thought via Hierarchical ExpertCode1
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical AssistantsCode1
IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design PatentsCode1
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language ModelsCode1
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of ExpertsCode1
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at ScaleCode1
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
Cross-modal Information Flow in Multimodal Large Language ModelsCode1
Teaching VLMs to Localize Specific Objects from In-context ExamplesCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
Nearest Neighbor Normalization Improves Multimodal RetrievalCode1
Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change DetectionCode1
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningCode1
Progressive Compositionality In Text-to-Image Generative ModelsCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global CuisinesCode1
VividMed: Vision Language Model with Versatile Visual Grounding for MedicineCode1
Towards Foundation Models for 3D Vision: How Close Are We?Code1
Skipping Computations in Multimodal LLMsCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
ActiView: Evaluating Active Perception Ability for Multimodal Large Language ModelsCode1
MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM IntegrationCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness RecognitionCode1
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoECode1
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation modelsCode1
Evaluating Image Hallucination in Text-to-Image Generation with Question-AnsweringCode1
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMsCode1
LIME: Less Is More for MLLM EvaluationCode1
M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA frameworkCode1
V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?Code1
Visual Agents as Fast and Slow ThinkersCode1
Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic SurgeryCode1
Boosting Audio Visual Question Answering via Key Semantic-Aware CuesCode1
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language ModelCode1
Learning Trimodal Relation for AVQA with Missing ModalityCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack BenchmarkCode1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding EvaluationCode1
MM-Instruct: Generated Visual Instructions for Large Multimodal Model AlignmentCode1
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-AnsweringCode1
Show:102550
← PrevPage 6 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified