SOTAVerified

Benchmarking

Papers

Showing 24712480 of 5548 papers

TitleStatusHype
E(3)-equivariant models cannot learn chirality: Field-based molecular generation0
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Benchmarking the Robustness of Panoptic Segmentation for Automated Driving0
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
Benchmarking Observational Studies with Experimental Data under Right-Censoring0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM AssessmentCode1
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation LearningCode1
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media PlatformsCode0
Show:102550
← PrevPage 248 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified