SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 301350 of 2177 papers

TitleStatusHype
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
Attention in Reasoning: Dataset, Analysis, and ModelingCode1
M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA frameworkCode1
Attention-Based Context Aware Reasoning for Situation RecognitionCode1
Combo of Thinking and Observing for Outside-Knowledge VQACode1
Linearly Mapping from Image to Text SpaceCode1
A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question AnsweringCode1
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMsCode1
LIME: Less Is More for MLLM EvaluationCode1
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question AnsweringCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
A Survey on Efficient Vision-Language ModelsCode1
Learning to Contrast the Counterfactual Samples for Robust Visual Question AnsweringCode1
Learning Trimodal Relation for AVQA with Missing ModalityCode1
Dynamic Language Binding in Relational Visual ReasoningCode1
Learning Cooperative Visual Dialog Agents with Deep Reinforcement LearningCode1
Learning Situation Hyper-Graphs for Video Question AnsweringCode1
LaTr: Layout-Aware Transformer for Scene-Text VQACode1
Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-AttentionCode1
ActiView: Evaluating Active Perception Ability for Multimodal Large Language ModelsCode1
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art BaselineCode1
LIVE: Learnable In-Context Vector for Visual Question AnsweringCode1
Learning to Answer Questions in Dynamic Audio-Visual ScenariosCode1
Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic ReasoningCode1
Dual-Key Multimodal Backdoors for Visual Question AnsweringCode1
Coarse-to-Fine Reasoning for Visual Question AnsweringCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
LaPA: Latent Prompt Assist Model For Medical Visual Question AnsweringCode1
CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic SurgeryCode1
COBRA: Contrastive Bi-Modal Representation AlgorithmCode1
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsCode1
Language Repository for Long Video UnderstandingCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network CalibrationCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge AlignmentCode1
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQACode1
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D ScenesCode1
Language Quantized AutoEncoders: Towards Unsupervised Text-Image AlignmentCode1
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairsCode1
Learning to Answer Visual Questions from Web VideosCode1
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray ImagesCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset BiasesCode1
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation EmbeddingCode1
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical ReasoningCode1
Show:102550
← PrevPage 7 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified