SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 601650 of 2177 papers

TitleStatusHype
CaMML: Context-Aware Multimodal Learner for Large ModelsCode1
Check It Again:Progressive Visual Question Answering via Visual EntailmentCode1
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language ModelsCode1
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly DetectionCode1
InfMLLM: A Unified Framework for Visual-Language TasksCode1
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language ModelCode1
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question AnsweringCode1
Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real ImagesCode1
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsCode1
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question AnsweringCode1
Dynamic Language Binding in Relational Visual ReasoningCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene UnderstandingCode1
I2I: Initializing Adapters with Improvised KnowledgeCode1
Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide ImagesCode1
Multimodal Federated Learning via Contrastive Representation EnsembleCode1
LaPA: Latent Prompt Assist Model For Medical Visual Question AnsweringCode1
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its ApplicationsCode1
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New BenchmarkCode1
Faithful Multimodal Explanation for Visual Question AnsweringCode1
Skipping Computations in Multimodal LLMsCode1
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering0
Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering0
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks0
DUBLIN -- Document Understanding By Language-Image Network0
BuDDIE: A Business Document Dataset for Multi-task Information Extraction0
How Much Can CLIP Benefit Vision-and-Language Tasks?0
Adversarial Representation Learning for Text-to-Image Matching0
AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making0
Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems0
DualNet: Domain-Invariant Network for Visual Question Answering0
Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets0
Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering0
Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering0
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites0
Breaking Neural Network Scaling Laws with Modularity0
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback0
Breaking Down Questions for Outside-Knowledge Visual Question Answering0
Answer-Type Prediction for Visual Question Answering0
How good are deep models in understanding the generated images?0
How to Design Sample and Computationally Efficient VQA Models0
Breaking Down Questions for Outside-Knowledge VQA0
Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness0
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images0
Adversarial Multimodal Network for Movie Question Answering0
Domain-robust VQA with diverse datasets and methods but no target labels0
Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion0
Domain Adaptation of VLM for Soccer Video Understanding0
Show:102550
← PrevPage 13 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified