SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 3031–3040 of 5548 papers

Title	Date	Tasks	Status	Hype
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	Jun 18, 2024	BenchmarkingWorld Knowledge	CodeCode Available	0
Automatic benchmarking of large multimodal models via iterative experiment programming	Jun 18, 2024	BenchmarkingLanguage Modeling	CodeCode Available	0
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions	Jun 18, 2024	BenchmarkingMultiple-choice	CodeCode Available	0
The Liouville Generator for Producing Integrable Expressions	Jun 17, 2024	Benchmarking	—Unverified	0
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models	Jun 17, 2024	Benchmarkingcounterfactual	—Unverified	0
InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States	Jun 17, 2024	BenchmarkingContrastive Learning	—Unverified	0
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations	Jun 17, 2024	BenchmarkingDataset Generation	CodeCode Available	0
Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading	Jun 17, 2024	Autonomous VehiclesBenchmarking	—Unverified	0
Benchmarking of LLM Detection: Comparing Two Competing Approaches	Jun 17, 2024	Benchmarking	—Unverified	0
Standardizing Structural Causal Models	Jun 17, 2024	BenchmarkingCausal Inference	CodeCode Available	0

Show:10 25 50

← PrevPage 304 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified