SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 651700 of 2177 papers

TitleStatusHype
Improved Alignment of Modalities in Large Vision Language Models0
Domain Adaptation of VLM for Soccer Video Understanding0
Do Explanations make VQA Models more Predictable to a Human?0
Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects0
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs0
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?0
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!0
Boosting Cross-task Transferability of Adversarial Patches with Visual Relations0
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering0
BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining0
Document Visual Question Answering Challenge 20200
Document Collection Visual Question Answering0
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models0
Improved Bilinear Pooling with CNNs0
Improving Users' Mental Model with Attention-directed Counterfactual Edits0
Document AI: Benchmarks, Models and Applications0
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models0
Image Semantic Relation Generation0
DLIP: Distilling Language-Image Pre-training0
Generating Question Relevant Captions to Aid Visual Question Answering0
ImageTTR: Grounding Type Theory with Records in Image Classification for Visual Question Answering0
Diversity and Consistency: Exploring Visual Question-Answer Pair Generation0
Diversifying Joint Vision-Language Tokenization Learning0
DistilDoc: Knowledge Distillation for Visually-Rich Document Applications0
Adversarial Attacks Beyond the Image Space0
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models0
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA0
Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs0
Image Captioning with Compositional Neural Module Networks0
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering0
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach0
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models0
Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning0
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models0
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels0
A Novel Framework for Robustness Analysis of Visual QA Models0
Image Captioning and Visual Question Answering Based on Attributes and External Knowledge0
Image Position Prediction in Multimodal Documents0
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance0
Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions0
Differentiable End-to-End Program Executor for Sample and Computationally Efficient VQA0
A Novel Attention-based Aggregation Function to Combine Vision and Language0
Beyond the Hype: A dispassionate look at vision-language models in medical scenario0
Advancing Surgical VQA with Scene Graph Knowledge0
Detection-based Intermediate Supervision for Visual Question Answering0
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos0
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models0
CLIPPO: Image-and-Language Understanding from Pixels Only0
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models0
Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation0
Show:102550
← PrevPage 14 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified