SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 791–800 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations	Mar 21, 2024	BenchmarkingMemorization	CodeCode Available	1
Benchmarking Adversarial Patch Against Aerial Detection	Oct 30, 2022	Benchmarking	CodeCode Available	1
Benchmarking Data Science Agents	Feb 27, 2024	BenchmarkingCode Generation	CodeCode Available	1
FELM: Benchmarking Factuality Evaluation of Large Language Models	Oct 1, 2023	BenchmarkingMath	CodeCode Available	1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling	Jan 21, 2024	Benchmarking	CodeCode Available	1
Benchmarking Adversarial Robustness on Image Classification	Jun 1, 2020	Adversarial AttackAdversarial Robustness	CodeCode Available	1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methods	Aug 2, 2022	BenchmarkingCausal Discovery	CodeCode Available	1
FineSurE: Fine-grained Summarization Evaluation using LLMs	Jul 1, 2024	BenchmarkingHallucination	CodeCode Available	1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization	Apr 6, 2025	BenchmarkingCombinatorial Optimization	CodeCode Available	1
CommonPower: A Framework for Safe Data-Driven Smart Grid Control	Jun 5, 2024	Benchmarkingenergy management	CodeCode Available	1

Show:10 25 50

← PrevPage 80 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified