SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2311–2320 of 5548 papers

Title	Date	Tasks	Status	Hype
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution	Apr 9, 2024	Benchmarking	—Unverified	0
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs	Apr 9, 2024	BenchmarkingCode Generation	—Unverified	0
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents	Apr 9, 2024	Benchmarking	CodeCode Available	1
EFSA: Towards Event-Level Financial Sentiment Analysis	Apr 8, 2024	ArticlesBenchmarking	CodeCode Available	0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering	Apr 8, 2024	BenchmarkingMedical Question Answering	—Unverified	0
HOEG: A New Approach for Object-Centric Predictive Process Monitoring	Apr 8, 2024	BenchmarkingGraph Neural Network	CodeCode Available	0
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level	Apr 8, 2024	Benchmarking	CodeCode Available	0
A Comparison of Cryptocurrency Volatility-benchmarking New and Mature Asset Classes	Apr 7, 2024	Benchmarking	—Unverified	0
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models	Apr 7, 2024	Benchmarkingknowledge editing	CodeCode Available	0
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics	Apr 6, 2024	BenchmarkingHallucination	CodeCode Available	0

Show:10 25 50

← PrevPage 232 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified