SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 376400 of 2177 papers

TitleStatusHype
Evaluating Multimodal Representations on Visual Semantic Textual SimilarityCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D ScenesCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous DrivingCode1
Localized Questions in Medical Visual Question AnsweringCode1
MapQA: A Dataset for Question Answering on Choropleth MapsCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical ReasoningCode1
LIME: Less Is More for MLLM EvaluationCode1
Genixer: Empowering Multimodal Large Language Models as a Powerful Data GeneratorCode1
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMsCode1
Linearly Mapping from Image to Text SpaceCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual ReasoningCode1
An Approach to Solving the Abstraction and Reasoning Corpus (ARC) ChallengeCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question AnsweringCode1
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray ImagesCode1
Cross-modal Information Flow in Multimodal Large Language ModelsCode1
Learning to Contrast the Counterfactual Samples for Robust Visual Question AnsweringCode1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
Learning to Answer Visual Questions from Web VideosCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Show:102550
← PrevPage 16 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified