SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 601625 of 2177 papers

TitleStatusHype
Hierarchical multimodal transformers for Multi-Page DocVQACode1
CaMML: Context-Aware Multimodal Learner for Large ModelsCode1
Hierarchical Question-Image Co-Attention for Visual Question AnsweringCode1
How to Configure Good In-Context Sequence for Visual Question AnsweringCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
MixGen: A New Multi-Modal Data AugmentationCode1
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeCode1
Florence: A New Foundation Model for Computer VisionCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
HAAR: Text-Conditioned Generative Model of 3D Strand-based Human HairstylesCode1
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real ImagesCode1
Hallucination Augmented Contrastive Learning for Multimodal Large Language ModelCode1
Dynamic Language Binding in Relational Visual ReasoningCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
Foundation Model is Efficient Multimodal Multitask Model SelectorCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
Label-Descriptive Patterns and Their Application to Characterizing Classification ErrorsCode1
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga UnderstandingCode1
Multi-Step Visual Reasoning with Visual Tokens Scaling and VerificationCode1
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question AnsweringCode1
Faithful Multimodal Explanation for Visual Question AnsweringCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering0
Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering0
Show:102550
← PrevPage 25 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified