SOTAVerified

Benchmarking

Papers

Showing 791800 of 5548 papers

TitleStatusHype
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
Benchmarking Adversarial Patch Against Aerial DetectionCode1
Benchmarking Data Science AgentsCode1
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
Benchmarking Adversarial Robustness on Image ClassificationCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
FineSurE: Fine-grained Summarization Evaluation using LLMsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Show:102550
← PrevPage 80 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified