SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1161–1170 of 5548 papers

Title	Date	Tasks	Status	Hype
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data	Feb 27, 2024	Benchmarking	CodeCode Available	1
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates	Jul 8, 2024	Benchmarkingknowledge editing	CodeCode Available	1
Benchmarking Large Language Models for News Summarization	Jan 31, 2023	BenchmarkingNews Summarization	CodeCode Available	1
Benchmarking Multidomain English-Indonesian Machine Translation	May 1, 2020	BenchmarkingMachine Translation	CodeCode Available	1
CoDEx: A Comprehensive Knowledge Graph Completion Benchmark	Sep 16, 2020	BenchmarkingKnowledge Graph Completion	CodeCode Available	1
Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation	Mar 7, 2024	BenchmarkingMultimodal Recommendation	CodeCode Available	1
CODEMENV: Benchmarking Large Language Models on Code Migration	Jun 1, 2025	Benchmarking	CodeCode Available	1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking	Jan 22, 2020	Benchmarkingobject-detection	CodeCode Available	1
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations	Jul 4, 2018	Adversarial DefenseBenchmarking	CodeCode Available	1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation	Feb 26, 2025	BenchmarkingCode Generation	CodeCode Available	1

Show:10 25 50

← PrevPage 117 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified