SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 431–440 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models	May 26, 2025	BenchmarkingRAG	CodeCode Available	1
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning	May 25, 2025	BenchmarkingVisual Reasoning	CodeCode Available	1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering	May 25, 2025	AnatomyBenchmarking	CodeCode Available	1
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions	May 23, 2025	2kBenchmarking	CodeCode Available	1
Semantic Correspondence: Unified Benchmarking and a Strong Baseline	May 23, 2025	BenchmarkingSemantic correspondence	CodeCode Available	1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph	May 23, 2025	BenchmarkingManagement	CodeCode Available	1
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow	May 23, 2025	BenchmarkingCode Generation	CodeCode Available	1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering	May 22, 2025	BenchmarkingEvidence Selection	CodeCode Available	1
REOBench: Benchmarking Robustness of Earth Observation Foundation Models	May 22, 2025	BenchmarkingContrastive Learning	CodeCode Available	1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	1

Show:10 25 50

← PrevPage 44 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified