SOTAVerified

Benchmarking

Papers

Showing 28312840 of 5548 papers

TitleStatusHype
BLESS: Benchmarking Large Language Models on Sentence SimplificationCode0
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic0
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual DesignCode0
XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series ClassificationCode0
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Fast hyperboloid decision tree algorithmsCode1
Benchmarking and Improving Text-to-SQL Generation under AmbiguityCode0
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language ModelsCode0
Show:102550
← PrevPage 284 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified