SOTAVerified

Benchmarking

Papers

Showing 24712480 of 5548 papers

TitleStatusHype
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
AI-generated Image Quality Assessment in Visual CommunicationCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Are Large Language Models Good at Utility Judgments?Code0
DispaRisk: Auditing Fairness Through Usable InformationCode0
A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision MakingCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Generalization and Regularization in DQNCode0
Show:102550
← PrevPage 248 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified