SOTAVerified

MMR total

Sum of all scores of the 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding in the Multi-Modal Reading (MMR) Benchmark.

Papers

Showing 110 of 12 papers

TitleStatusHype
Visual Instruction TuningCode6
GPT-4 Technical ReportCode6
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsCode3
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text DocumentsCode2
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)Code1
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCode1
What matters when building vision-language models?0
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding0
MMR: Evaluating Reading Ability of Large Multimodal Models0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Claude 3.5 SonnetTotal Column Score463Unverified
2GPT-4oTotal Column Score457Unverified
3GPT-4VTotal Column Score415Unverified
4LLaVA-NEXT-34BTotal Column Score412Unverified
5Phi-3-VisionTotal Column Score397Unverified
6InternVL2-8BTotal Column Score368Unverified
7Qwen-vl-maxTotal Column Score366Unverified
8LLaVA-NEXT-13BTotal Column Score335Unverified
9Qwen-vl-plusTotal Column Score310Unverified
10Idefics-2-8BTotal Column Score256Unverified