SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 176200 of 2177 papers

TitleStatusHype
LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsCode2
LLMGA: Multimodal Large Language Model based Generation AssistantCode2
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsCode2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsCode2
A Simple Aerial Detection Baseline of Multimodal Language ModelsCode2
Large Continual Instruction AssistantCode2
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question AnsweringCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
LingoQA: Visual Question Answering for Autonomous DrivingCode2
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksCode2
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language ModelsCode2
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual ContextsCode2
Beyond Text: Frozen Large Language Models in Visual Signal ComprehensionCode2
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical ModalitiesCode2
JourneyDB: A Benchmark for Generative Image UnderstandingCode2
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal PerceptionCode2
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AICode2
EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysisCode2
Imp: Highly Capable Large Multimodal Models for Mobile DevicesCode2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal ReasoningCode2
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration RateCode2
Grounding-IQA: Multimodal Language Grounding Model for Image Quality AssessmentCode2
Show:102550
← PrevPage 8 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified