SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 201225 of 2177 papers

TitleStatusHype
JourneyDB: A Benchmark for Generative Image UnderstandingCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
Large Continual Instruction AssistantCode2
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal ReasoningCode2
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksCode2
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical ModalitiesCode2
Beyond Text: Frozen Large Language Models in Visual Signal ComprehensionCode2
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert ReasonerCode2
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion RefinementCode2
Doe-1: Closed-Loop Autonomous Driving with Large World ModelCode2
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario UnderstandingCode2
Imp: Highly Capable Large Multimodal Models for Mobile DevicesCode2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningCode2
Med-Flamingo: a Multimodal Medical Few-shot LearnerCode2
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-ExpertsCode2
Towards A Generalizable Pathology Foundation Model via Unified Knowledge DistillationCode2
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question AnsweringCode2
Phantom of Latent for Large Language and Vision ModelsCode2
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
How Much Can CLIP Benefit Vision-and-Language Tasks?Code1
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image DetectorsCode1
A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question AnsweringCode1
DeVLBert: Learning Deconfounded Visio-Linguistic RepresentationsCode1
How to Configure Good In-Context Sequence for Visual Question AnsweringCode1
Show:102550
← PrevPage 9 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified