SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 501525 of 2177 papers

TitleStatusHype
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question AnsweringCode1
Change Detection Meets Visual Question AnsweringCode1
Foundation Model is Efficient Multimodal Multitask Model SelectorCode1
Making Large Language Models Better Data CreatorsCode1
OK-VQA: A Visual Question Answering Benchmark Requiring External KnowledgeCode1
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using CapsulesCode1
MISS: A Generative Pretraining and Finetuning Approach for Med-VQACode1
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout TransformerCode1
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask ArchitectureCode1
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga UnderstandingCode1
AI2-THOR: An Interactive 3D Environment for Visual AICode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual FeaturesCode1
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous DrivingCode1
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question AnsweringCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual ScenesCode1
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
MemeCap: A Dataset for Captioning and Interpreting MemesCode1
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
Can We Talk Models Into Seeing the World Differently?Code1
Faithful Multimodal Explanation for Visual Question AnsweringCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
Show:102550
← PrevPage 21 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified