SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 11511200 of 2177 papers

TitleStatusHype
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL ModelsCode1
Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA0
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge0
Multi-Scale Attention for Audio Question AnsweringCode1
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa LanguageCode0
Modularized Zero-shot VQA with Pre-trained ModelsCode0
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
Zero-shot Visual Question Answering with Language Model FeedbackCode0
Mindstorms in Natural Language-Based Societies of Mind0
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksCode2
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought0
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving ScenarioCode2
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions0
Measuring Faithful and Plausible Visual Grounding in VQACode0
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language ModelsCode1
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering0
The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language ModelsCode1
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach0
MemeCap: A Dataset for Captioning and Interpreting MemesCode1
i-Code Studio: A Configurable and Composable Framework for Integrative AI0
DUBLIN -- Document Understanding By Language-Image Network0
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual ScenariosCode0
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language ModelsCode1
What Makes for Good Visual Tokenizers for Large Language Models?Code1
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and TextsCode1
Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature0
What You See is What You Read? Improving Text-Image Alignment EvaluationCode1
IMAD: IMage-Augmented multi-modal DialogueCode0
An Empirical Study on the Language Modal in Visual Question Answering0
Probing the Role of Positional Information in Vision-Language Models0
PMC-VQA: Visual Instruction Tuning for Medical Visual Question AnsweringCode1
Semantic Composition in Visually Grounded Language Models0
OCRBench: On the Hidden Mystery of OCR in Large Multimodal ModelsCode2
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningCode2
Combo of Thinking and Observing for Outside-Knowledge VQACode1
Vision-Language Models in Remote Sensing: Current Progress and Future TrendsCode1
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseCode0
Adaptive loose optimization for robust question answeringCode0
Otter: A Multi-Modal Model with In-Context Instruction TuningCode4
Analysis of Visual Question Answering Algorithms with attention model0
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime0
CHIC: Corporate Document for Visual question Answering0
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction ModelCode5
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal PretrainingCode1
A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question AnsweringCode1
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in SurgeryCode1
Learning Situation Hyper-Graphs for Video Question AnsweringCode1
Visual Instruction TuningCode6
Show:102550
← PrevPage 24 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified