SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 851900 of 2177 papers

TitleStatusHype
Neural Reasoning, Fast and Slow, for Video Question Answering0
Improving Automatic VQA Evaluation Using Large Language Models0
Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning0
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning0
Hadamard product in deep learning: Introduction, Advances and Challenges0
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent0
CQ-VQA: Visual Question Answering on Categorized Questions0
Learning to Disambiguate by Asking Discriminative Questions0
Learning to Recognize the Unseen Visual Predicates0
Improving Visual Question Answering by Referring to Generated Paragraph Captions0
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision0
Improving VQA and its Explanations \\ by Comparing Competing Explanations0
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis0
Look, Learn and Leverage (L^3): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment0
H2OVL-Mississippi Vision Language Models Technical Report0
CPL: Counterfactual Prompt Learning for Vision and Language Models0
In Factuality: Efficient Integration of Relevant Facts for Visual Question Answering0
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding0
Guiding Visual Question Answering with Attention Priors0
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology0
Auto-Parsing Network for Image Captioning and Visual Question Answering0
Learning Sparse Mixture of Experts for Visual Question Answering0
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning0
Instance-Level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space0
Grounding Task Assistance with Multimodal Cues from a Single Demonstration0
Co-VQA : Answering by Interactive Sub Question Sequence0
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching0
Grounding Complex Navigational Instructions Using Scene Graphs0
Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports0
Co-VQA : Answering by Interactive Sub Question Sequence0
Learning Sparsity for Effective and Efficient Music Performance Question Answering0
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models0
Grounding Answers for Visual Questions Asked by Visually Impaired People0
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding0
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent0
Interactive Attention AI to translate low light photos to captions for night scene understanding in women safety0
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs0
Grounded Word Sense Translation0
Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray0
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model0
Learning Rich Image Region Representation for Visual Question Answering0
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions0
Counterfactual Vision and Language Learning0
Interpretable Counting for Visual Question Answering0
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models0
Analysis on Image Set Visual Question Answering0
GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback0
Graph-Structured Representations for Visual Question Answering0
Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture0
Bilinear Graph Networks for Visual Question Answering0
Show:102550
← PrevPage 18 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified