SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 751800 of 2177 papers

TitleStatusHype
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields0
Improved Alignment of Modalities in Large Vision Language Models0
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic ReconstructionCode0
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?0
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation0
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels0
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering0
Where is this coming from? Making groundedness count in the evaluation of Document VQA models0
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models0
Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative ModelsCode0
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical StudyCode0
UMIT: Unifying Medical Imaging Tasks via Vision-Language ModelsCode0
A Vision Centric Remote Sensing Benchmark0
GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback0
TruthLens:A Training-Free Paradigm for DeepFake Detection0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models0
Marten: Visual Question Answering with Mask Generation for Multi-modal Document UnderstandingCode0
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference0
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing0
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models0
T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image EvaluationCode0
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models0
SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery0
On the Limitations of Vision-Language Models in Understanding Image Transforms0
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework0
Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru0
From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector Graphics0
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems0
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models0
MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering0
Treble Counterfactual VLMs: A Causal Approach to HallucinationCode0
SplatTalk: 3D VQA with Gaussian Splatting0
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation0
Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text RepresentationsCode0
BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQACode0
OWLViz: An Open-World Benchmark for Visual Question Answering0
Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language ModelsCode0
FunBench: Benchmarking Fundus Reading Skills of MLLMs0
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering0
Fine-Grained Retrieval-Augmented Generation for Visual Question Answering0
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language ModelsCode0
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios0
Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation0
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning0
Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based InferenceCode0
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA0
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark0
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines0
Show:102550
← PrevPage 16 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified