SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1801–1810 of 5548 papers

Title	Date	Tasks	Status	Hype
Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection	Jul 28, 2024	BenchmarkingFake News Detection	—Unverified	0
On the Evaluation Consistency of Attribution-based Explanations	Jul 28, 2024	Benchmarking	CodeCode Available	0
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation	Jul 26, 2024	BenchmarkingDocument AI	CodeCode Available	1
Benchmarking Dependence Measures to Prevent Shortcut Learning in Medical Imaging	Jul 26, 2024	Benchmarking	CodeCode Available	0
VoxSim: A perceptual voice similarity dataset	Jul 26, 2024	BenchmarkingSpeaker Recognition	CodeCode Available	1
Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems	Jul 26, 2024	Benchmarking	—Unverified	0
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents	Jul 26, 2024	BenchmarkingCode Generation	CodeCode Available	3
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks	Jul 26, 2024	BenchmarkingModel Selection	CodeCode Available	1
SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images	Jul 25, 2024	Benchmarking	—Unverified	0
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy	Jul 25, 2024	Benchmarking	—Unverified	0

Show:10 25 50

← PrevPage 181 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified