SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 751800 of 2177 papers

TitleStatusHype
Curriculum Learning Effectively Improves Low Data VQA0
An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation0
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering0
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning0
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses0
CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering0
CS-VQA: Visual Question Answering with Compressively Sensed Images0
Balancing Performance and Efficiency in Zero-shot Robotic Navigation0
CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization0
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models0
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey0
Cross-Modal Retrieval Augmentation for Multi-Modal Classification0
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs0
An Empirical Evaluation of Visual Question Answering for Novel Objects0
JEEM: Vision-Language Understanding in Four Arabic Dialects0
How to find a good image-text embedding for remote sensing visual question answering?0
Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering0
iVQA: Inverse Visual Question Answering0
How Much Can CLIP Benefit Vision-and-Language Tasks?0
Cross-Modal Generative Augmentation for Visual Question Answering0
Backdooring Vision-Language Models with Out-Of-Distribution Data0
Jaeger: A Concatenation-Based Multi-Transformer VQA Model0
Anatomy Might Be All You Need: Forecasting What to Do During Surgery0
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision0
Iterated learning for emergent systematicity in VQA0
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention0
Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion0
Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?0
It Takes Two to Tango: Towards Theory of AI's Mind0
Hierarchical Graph Attention Network for Few-Shot Visual-Semantic Learning0
Inverse Visual Question Answering with Multi-Level Attentions0
A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning0
Crossformer: Transformer with Alternated Cross-Layer Guidance0
Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool0
Investigating Biases in Textual Entailment Datasets0
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation0
HAMMR: HierArchical MultiModal React agents for generic VQA0
Cross-Dataset Adaptation for Visual Question Answering0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
How good are deep models in understanding the generated images?0
A Vision Centric Remote Sensing Benchmark0
An Analysis of Visual Question Answering Algorithms0
A dataset of clinically generated visual questions and answers about radiology images0
CROME: Cross-Modal Adapters for Efficient Multimodal LLM0
Interpretable Visual Question Answering via Reasoning Supervision0
How to Design Sample and Computationally Efficient VQA Models0
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning0
How Transferable are Reasoning Patterns in VQA?0
CREPE: Coordinate-Aware End-to-End Document Parser0
Interpretable Visual Reasoning via Probabilistic Formulation under Natural Supervision0
Show:102550
← PrevPage 16 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified