SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 251300 of 2177 papers

TitleStatusHype
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
Language Repository for Long Video UnderstandingCode1
Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question AnsweringCode1
Localized Questions in Medical Visual Question AnsweringCode1
Are Bias Mitigation Techniques for Deep Learning Effective?Code1
Counterfactual Samples Synthesizing for Robust Visual Question AnsweringCode1
LaPA: Latent Prompt Assist Model For Medical Visual Question AnsweringCode1
Learning Cooperative Visual Dialog Agents with Deep Reinforcement LearningCode1
Linearly Mapping from Image to Text SpaceCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Beyond Embeddings: The Promise of Visual Table in Visual ReasoningCode1
Label-Descriptive Patterns and Their Application to Characterizing Classification ErrorsCode1
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation EmbeddingCode1
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text InjectionCode1
Just Ask: Learning to Answer Questions from Millions of Narrated VideosCode1
Combo of Thinking and Observing for Outside-Knowledge VQACode1
Advancing High Resolution Vision-Language Models in BiomedicineCode1
JDocQA: Japanese Document Question Answering Dataset for Generative Language ModelsCode1
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched PromptsCode1
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question AnsweringCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningCode1
Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical ReasoningCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
Instruction-Guided Visual MaskingCode1
Bayesian Attention ModulesCode1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
Language-Informed Visual Concept LearningCode1
InfMLLM: A Unified Framework for Visual-Language TasksCode1
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQACode1
In Defense of Grid Features for Visual Question AnsweringCode1
Improving Selective Visual Question Answering by Learning from Your PeersCode1
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language ModelCode1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction TuningCode1
IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design PatentsCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
Visual Grounding Methods for VQA are Working for the Wrong Reasons!Code1
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language ModelsCode1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
A Dataset and Baselines for Visual Question Answering on ArtCode1
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
An Approach to Solving the Abstraction and Reasoning Corpus (ARC) ChallengeCode1
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language ModelsCode1
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and LanguagesCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge AlignmentCode1
I2I: Initializing Adapters with Improvised KnowledgeCode1
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionCode1
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4Code1
Show:102550
← PrevPage 6 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified