SOTAVerified

Benchmarking

Papers

Showing 30313040 of 5548 papers

TitleStatusHype
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop ReasoningCode0
Automatic benchmarking of large multimodal models via iterative experiment programmingCode0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice QuestionsCode0
The Liouville Generator for Producing Integrable Expressions0
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in ExplanationsCode0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading0
Benchmarking of LLM Detection: Comparing Two Competing Approaches0
Standardizing Structural Causal ModelsCode0
Show:102550
← PrevPage 304 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified