SOTAVerified

Benchmarking

Papers

Showing 20312040 of 5548 papers

TitleStatusHype
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference ContentCode0
Standardizing Structural Causal ModelsCode0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language ModelsCode0
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics0
Evaluating the Performance of Large Language Models via Debates0
GANmut: Generating and Modifying Facial Expressions0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language ModelsCode2
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning0
Show:102550
← PrevPage 204 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified