SOTAVerified

Benchmarking

Papers

Showing 20112020 of 5548 papers

TitleStatusHype
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance0
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AICode2
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
WebCanvas: Benchmarking Web Agents in Online EnvironmentsCode3
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts0
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop ReasoningCode0
TSI-Bench: Benchmarking Time Series ImputationCode3
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models0
Show:102550
← PrevPage 202 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified