SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2671–2680 of 5548 papers

Title	Date	Tasks	Status	Hype
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation	Oct 11, 2024	BenchmarkingSentence	—Unverified	0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment	Oct 11, 2024	BenchmarkingReinforcement Learning (RL)	—Unverified	0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example	Oct 11, 2024	BenchmarkingCode Generation	—Unverified	0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks	Oct 11, 2024	BenchmarkingLanguage Modeling	—Unverified	0
Enterprise Benchmarks for Large Language Model Evaluation	Oct 11, 2024	BenchmarkingLanguage Model Evaluation	CodeCode Available	0
A Comparative Analysis on Ethical Benchmarking in Large Language Models	Oct 11, 2024	BenchmarkingDecision Making	—Unverified	0
Identifying Money Laundering Subgraphs on the Blockchain	Oct 10, 2024	Benchmarking	CodeCode Available	0
Audio Explanation Synthesis with Generative Foundation Models	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations	Oct 10, 2024	BenchmarkingDecision Making	CodeCode Available	0
Advocating Character Error Rate for Multilingual ASR Evaluation	Oct 9, 2024	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0

Show:10 25 50

← PrevPage 268 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified