SOTAVerified

Benchmarking

Papers

Showing 431440 of 5548 papers

TitleStatusHype
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questionsCode1
Semantic Correspondence: Unified Benchmarking and a Strong BaselineCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering WorkflowCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
REOBench: Benchmarking Robustness of Earth Observation Foundation ModelsCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Show:102550
← PrevPage 44 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified