SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2011–2020 of 5548 papers

Title	Date	Tasks	Status	Hype
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions	Jun 18, 2024	BenchmarkingMultiple-choice	CodeCode Available	0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance	Jun 18, 2024	Benchmarking	—Unverified	0
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI	Jun 18, 2024	Benchmarkingscientific discovery	CodeCode Available	2
Automatic benchmarking of large multimodal models via iterative experiment programming	Jun 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models	Jun 18, 2024	BenchmarkingDepth Estimation	CodeCode Available	2
WebCanvas: Benchmarking Web Agents in Online Environments	Jun 18, 2024	AI AgentBenchmarking	CodeCode Available	3
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts	Jun 18, 2024	ArticlesBenchmarking	—Unverified	0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	Jun 18, 2024	BenchmarkingWorld Knowledge	CodeCode Available	0
TSI-Bench: Benchmarking Time Series Imputation	Jun 18, 2024	BenchmarkingDeep Learning	CodeCode Available	3
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models	Jun 17, 2024	Benchmarkingcounterfactual	—Unverified	0

Show:10 25 50

← PrevPage 202 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified