SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1191–1200 of 5548 papers

Title	Date	Tasks	Status	Hype
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond	Jun 16, 2023	BenchmarkingEvidence Selection	CodeCode Available	1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIs	Nov 2, 2020	Benchmarking	CodeCode Available	1
Benchmarking Neural Network Generalization for Grammar Induction	Aug 16, 2023	Benchmarking	CodeCode Available	1
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations	Jul 4, 2018	Adversarial DefenseBenchmarking	CodeCode Available	1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT	Apr 3, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents	Feb 27, 2025	Benchmarking	CodeCode Available	1
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics	May 6, 2025	Benchmarking	CodeCode Available	1
EventEA: Benchmarking Entity Alignment for Event-centric Knowledge Graphs	Nov 5, 2022	AttributeBenchmarking	CodeCode Available	1
Benchmarking Large Language Models for Automated Verilog RTL Code Generation	Dec 13, 2022	BenchmarkingCode Generation	CodeCode Available	1
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates	Jul 8, 2024	Benchmarkingknowledge editing	CodeCode Available	1

Show:10 25 50

← PrevPage 120 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified