SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 151200 of 2177 papers

TitleStatusHype
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsCode2
Dual Diffusion for Unified Image Generation and UnderstandingCode2
MedPromptX: Grounded Multimodal Prompting for Chest X-ray DiagnosisCode2
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image AnalysisCode2
GSCo: Towards Generalizable AI in Medicine via Generalist-Specialist CollaborationCode2
MDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode2
Med-Flamingo: a Multimodal Medical Few-shot LearnerCode2
Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question AnsweringCode2
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language ModelsCode2
Aligning Modalities in Vision Large Language Models via Preference Fine-tuningCode2
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language ModelsCode2
Doe-1: Closed-Loop Autonomous Driving with Large World ModelCode2
LOVA3: Learning to Visual Question Answering, Asking and AssessmentCode2
MedM-VL: What Makes a Good Medical LVLM?Code2
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal ModelsCode2
MG-LLaVA: Towards Multi-Granularity Visual Instruction TuningCode2
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningCode2
Efficient Large Multi-modal Models via Visual Context CompressionCode2
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCode2
Calibrated Self-Rewarding Vision Language ModelsCode2
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context SparsificationCode2
DreamLLM: Synergistic Multimodal Comprehension and CreationCode2
LLMGA: Multimodal Large Language Model based Generation AssistantCode2
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language UnderstandingCode2
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual ContextsCode2
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsCode2
Large Continual Instruction AssistantCode2
A Simple Aerial Detection Baseline of Multimodal Language ModelsCode2
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsCode2
MTVQA: Benchmarking Multilingual Text-Centric Visual Question AnsweringCode2
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question AnsweringCode2
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksCode2
LingoQA: Visual Question Answering for Autonomous DrivingCode2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsCode2
OneLLM: One Framework to Align All Modalities with LanguageCode2
LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsCode2
MC-LLaVA: Multi-Concept Personalized Vision-Language ModelCode2
Beyond Text: Frozen Large Language Models in Visual Signal ComprehensionCode2
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical ModalitiesCode2
JourneyDB: A Benchmark for Generative Image UnderstandingCode2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningCode2
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion RefinementCode2
Imp: Highly Capable Large Multimodal Models for Mobile DevicesCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal ReasoningCode2
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AICode2
Phantom of Latent for Large Language and Vision ModelsCode2
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration RateCode2
Grounding-IQA: Multimodal Language Grounding Model for Image Quality AssessmentCode2
Show:102550
← PrevPage 4 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified