SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 301350 of 2177 papers

TitleStatusHype
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory PerspectiveCode0
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy0
FFA Sora, video generation as fundus fluorescein angiography simulator0
Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering0
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object LocalizationCode0
NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional GeneralizationCode0
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven OptimizationCode1
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning0
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language ModelsCode0
Consistency of Compositional Generalization across Multiple LevelsCode0
MedCoT: Medical Chain of Thought via Hierarchical ExpertCode1
A Concept-Centric Approach to Multi-Modality Learning0
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical AssistantsCode1
Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal CluesCode0
LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-SteeringCode0
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology0
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track0
Patch-level Sounding Object Tracking for Audio-Visual Question Answering0
Damage Assessment after Natural Disasters with UAVs: Semantic Feature Extraction using Deep Learning0
VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation0
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingCode9
ViUniT: Visual Unit Tests for More Robust Visual Programming0
Doe-1: Closed-Loop Autonomous Driving with Large World ModelCode2
Towards a Multimodal Large Language Model with Pixel-Level Insight for BiomedicineCode2
Lyra: An Efficient and Speech-Centric Framework for Omni-CognitionCode3
A Multimodal Social Agent0
Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual IllusionsCode0
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey0
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses0
Discrete Subgraph Sampling for Interpretable Graph based Visual Question AnsweringCode0
Can We Generate Visual Programs Without Prompting LLMs?0
IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design PatentsCode1
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical ModalitiesCode2
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal ModelsCode0
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language ModelsCode1
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question AnsweringCode0
Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels0
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance0
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-ActionCode2
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of ExpertsCode1
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora0
LinVT: Empower Your Image-level Large Language Model to Understand VideosCode2
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at ScaleCode1
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time ScalingCode0
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation0
VisionZip: Longer is Better but Not Necessary in Vision Language ModelsCode3
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts0
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual CompressionCode2
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsCode1
Show:102550
← PrevPage 7 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified