SOTAVerified

Benchmarking

Papers

Showing 26712680 of 5548 papers

TitleStatusHype
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation0
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment0
Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example0
uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks0
Enterprise Benchmarks for Large Language Model EvaluationCode0
A Comparative Analysis on Ethical Benchmarking in Large Language Models0
Identifying Money Laundering Subgraphs on the BlockchainCode0
Audio Explanation Synthesis with Generative Foundation ModelsCode0
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty SimulationsCode0
Advocating Character Error Rate for Multilingual ASR Evaluation0
Show:102550
← PrevPage 268 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified