SOTAVerified

Benchmarking

Papers

Showing 11611170 of 5548 papers

TitleStatusHype
Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource ScriptsCode0
AI-generated Image Quality Assessment in Visual CommunicationCode0
Generative CKM Construction using Partially Observed Data with Diffusion ModelCode1
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule GenerationCode1
Pitfalls of topology-aware image segmentation0
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksCode1
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World KnowledgeCode0
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning0
Show:102550
← PrevPage 117 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified