SOTAVerified

Visual Question Answering

MLLM Leaderboard

Papers

Showing 13311340 of 2177 papers

TitleStatusHype
On the Effects of Video Grounding on Language Models0
Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering0
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question AnsweringCode0
Task Formulation Matters When Learning Continually: A Case Study in Visual Question AnsweringCode0
Linearly Mapping from Image to Text SpaceCode1
TVLT: Textless Vision-Language TransformerCode1
RepsNet: Combining Vision with Language for Automated Medical Reports0
Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong BaselineCode1
Exploring Modulated Detection Transformer as a Tool for Action Recognition in VideosCode0
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering0
Show:102550
← PrevPage 134 of 218Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MMCTAgent (GPT-4 + GPT-4V)GPT-4 score74.24Unverified
2Qwen2-VL-72BGPT-4 score74Unverified
3InternVL2.5-78BGPT-4 score72.3Unverified
4GPT-4o +text rationale +IoTGPT-4 score72.2Unverified
5Lyra-ProGPT-4 score71.4Unverified
6GLM-4V-PlusGPT-4 score71.1Unverified
7Phantom-7BGPT-4 score70.8Unverified
8InternVL2.5-38BGPT-4 score68.8Unverified
9InternVL2-26B (SGP, token ratio 64%)GPT-4 score65.6Unverified
10Baichuan-Omni (7B)GPT-4 score65.4Unverified