SOTAVerified

MMR total

Sum of all scores of the 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding in the Multi-Modal Reading (MMR) Benchmark.

Papers

Showing 112 of 12 papers

TitleStatusHype
Visual Instruction TuningCode6
GPT-4 Technical ReportCode6
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsCode3
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text DocumentsCode2
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)Code1
MMR: Evaluating Reading Ability of Large Multimodal Models0
Claude 3.5 Sonnet Model Card Addendum0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
What matters when building vision-language models?0
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone0
Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 SonnetTotal Column Score463Unverified
2GPT-4oTotal Column Score457Unverified
3GPT-4VTotal Column Score415Unverified
4LLaVA-NEXT-34BTotal Column Score412Unverified
5Phi-3-VisionTotal Column Score397Unverified
6InternVL2-8BTotal Column Score368Unverified
7Qwen-vl-maxTotal Column Score366Unverified
8LLaVA-NEXT-13BTotal Column Score335Unverified
9Qwen-vl-plusTotal Column Score310Unverified
10Idefics-2-8BTotal Column Score256Unverified