SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2031–2040 of 5548 papers

Title	Date	Tasks	Status	Hype
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content	Jun 17, 2024	BenchmarkingGeneral Knowledge	CodeCode Available	0
Standardizing Structural Causal Models	Jun 17, 2024	BenchmarkingCausal Inference	CodeCode Available	0
Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex	Jun 16, 2024	BenchmarkingObject Recognition	—Unverified	0
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models	Jun 16, 2024	Benchmarking	CodeCode Available	0
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics	Jun 16, 2024	Benchmarkingde novo peptide sequencing	—Unverified	0
Evaluating the Performance of Large Language Models via Debates	Jun 16, 2024	Benchmarking	—Unverified	0
GANmut: Generating and Modifying Facial Expressions	Jun 16, 2024	BenchmarkingDiversity	—Unverified	0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences	Jun 16, 2024	BenchmarkingSpatial Reasoning	—Unverified	0
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models	Jun 16, 2024	Adversarial AttackBenchmarking	CodeCode Available	2
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning	Jun 16, 2024	BenchmarkingMath	—Unverified	0

Show:10 25 50

← PrevPage 204 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified