SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1201–1210 of 5548 papers

Title	Date	Tasks	Status	Hype
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond	Jun 16, 2023	BenchmarkingEvidence Selection	CodeCode Available	1
Explainable Global Wildfire Prediction Models using Graph Neural Networks	Feb 11, 2024	BenchmarkingCommunity Detection	CodeCode Available	1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking	Jul 3, 2024	BenchmarkingObject	CodeCode Available	1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT	Apr 3, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs	Nov 2, 2020	Benchmarking	CodeCode Available	1
EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions	Jun 8, 2021	Bayesian OptimisationBenchmarking	CodeCode Available	1
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics	May 6, 2025	Benchmarking	CodeCode Available	1
Benchmarking Large Language Models for Automated Verilog RTL Code Generation	Dec 13, 2022	BenchmarkingCode Generation	CodeCode Available	1
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms	Aug 25, 2017	BenchmarkingBIG-bench Machine Learning	CodeCode Available	1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents	Feb 27, 2025	Benchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 121 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified