SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 651700 of 2177 papers

TitleStatusHype
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly0
HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific DomainsCode0
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning0
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning0
Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering0
Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems0
TextVidBench: A Benchmark for Long Video Scene Text Understanding0
ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding0
Learning Sparsity for Effective and Efficient Music Performance Question Answering0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering0
Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck0
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models0
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility0
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation0
Synthetic Document Question Answering in HungarianCode0
QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without RetrainingCode0
Multi-Sourced Compositional Generalization in Visual Question AnsweringCode0
NegVQA: Can Vision Language Models Understand Negation?0
Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMsCode0
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringCode0
Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat0
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question AnsweringCode0
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance0
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-raysCode0
CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering0
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation0
A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering0
Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge0
Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports0
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding0
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets0
Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning0
Visual Question Answering on Multiple Remote Sensing Image Modalities0
SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision TasksCode0
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMsCode0
TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving0
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language ModelsCode0
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification0
Domain Adaptation of VLM for Soccer Video Understanding0
Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models0
Debating for Better Reasoning: An Unsupervised Multimodal Approach0
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method0
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture UnderstandingCode0
Understanding Complexity in VideoQA via Visual Program Generation0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMsCode0
End-to-End Vision Tokenizer Tuning0
Variational Visual Question Answering0
Visually Interpretable Subtask Reasoning for Visual Question AnsweringCode0
Show:102550
← PrevPage 14 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified