SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 11511175 of 2177 papers

TitleStatusHype
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL ModelsCode1
Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA0
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge0
Multi-Scale Attention for Audio Question AnsweringCode1
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa LanguageCode0
Modularized Zero-shot VQA with Pre-trained ModelsCode0
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
Zero-shot Visual Question Answering with Language Model FeedbackCode0
Mindstorms in Natural Language-Based Societies of Mind0
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksCode2
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought0
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving ScenarioCode2
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions0
Measuring Faithful and Plausible Visual Grounding in VQACode0
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language ModelsCode1
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering0
The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language ModelsCode1
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach0
MemeCap: A Dataset for Captioning and Interpreting MemesCode1
i-Code Studio: A Configurable and Composable Framework for Integrative AI0
DUBLIN -- Document Understanding By Language-Image Network0
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosCode0
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language ModelsCode1
What Makes for Good Visual Tokenizers for Large Language Models?Code1
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
Show:102550
← PrevPage 47 of 88Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified