SOTAVerified

Benchmarking

Papers

Showing 31113120 of 5548 papers

TitleStatusHype
ACCORD: Closing the Commonsense Measurability GapCode0
TruthEval: A Dataset to Evaluate LLM Truthfulness and ReliabilityCode0
LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions0
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection0
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models0
Scaffold Splits Overestimate Virtual Screening Performance0
WebSuite: Systematically Evaluating Why Web Agents FailCode0
On the project risk baseline: integrating aleatory uncertainty into project scheduling0
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images0
CoSy: Evaluating Textual Explanations of Neurons0
Show:102550
← PrevPage 312 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified