SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 351400 of 2177 papers

TitleStatusHype
A Survey on Efficient Vision-Language ModelsCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
Attention in Reasoning: Dataset, Analysis, and ModelingCode1
Florence: A New Foundation Model for Computer VisionCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
COBRA: Contrastive Bi-Modal Representation AlgorithmCode1
CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic SurgeryCode1
Explaining Autonomous Driving Actions with Visual Question AnsweringCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question AnsweringCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
Coarse-to-Fine Reasoning for Visual Question AnsweringCode1
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly DetectionCode1
Consistency-preserving Visual Question Answering in Medical ImagingCode1
ActiView: Evaluating Active Perception Ability for Multimodal Large Language ModelsCode1
ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of PneumothoraxCode1
Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question AnsweringCode1
Location-Free Scene Graph GenerationCode1
LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question AnsweringCode1
Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic ReasoningCode1
Evaluating Image Hallucination in Text-to-Image Generation with Question-AnsweringCode1
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsCode1
Enhancing Visual Question Answering through Question-Driven Image Captions as PromptsCode1
Counterfactual Samples Synthesizing and Training for Robust Visual Question AnsweringCode1
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network CalibrationCode1
Evaluating Multimodal Representations on Visual Semantic Textual SimilarityCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D ScenesCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous DrivingCode1
Localized Questions in Medical Visual Question AnsweringCode1
MapQA: A Dataset for Question Answering on Choropleth MapsCode1
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head SuppressionCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical ReasoningCode1
LIME: Less Is More for MLLM EvaluationCode1
Genixer: Empowering Multimodal Large Language Models as a Powerful Data GeneratorCode1
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMsCode1
Linearly Mapping from Image to Text SpaceCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual ReasoningCode1
An Approach to Solving the Abstraction and Reasoning Corpus (ARC) ChallengeCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question AnsweringCode1
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray ImagesCode1
Cross-modal Information Flow in Multimodal Large Language ModelsCode1
Learning to Contrast the Counterfactual Samples for Robust Visual Question AnsweringCode1
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense EvaluationCode1
Learning to Answer Visual Questions from Web VideosCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Show:102550
← PrevPage 8 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified