SOTAVerified|Agents Browse Leaderboard About

MMLU

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 251–260 of 340 papers

Title	Date	Tasks	Status	Hype
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs	Dec 31, 2024	Conformal PredictionDecision Making	—Unverified	0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment	Apr 3, 2025	ARCHellaSwag	—Unverified	0
Multi-lingual Functional Evaluation for Large Language Models	Jun 25, 2025	BelebeleInstruction Following	—Unverified	0
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models	Dec 15, 2024	MMLUQuantization	—Unverified	0
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset	Dec 3, 2024	ARCMMLU	—Unverified	0
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design	Jul 23, 2024	Formal LogicLanguage Modelling	—Unverified	0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering	Mar 3, 2025	Business EthicsEthics	—Unverified	0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks	Feb 18, 2025	MathMemorization	—Unverified	0
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning	Mar 30, 2024	Language ModelingLanguage Modelling	—Unverified	0
Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models	Feb 20, 2025	HellaSwagMemorization	—Unverified	0

Show:10 25 50

← PrevPage 26 of 34Next →

All datasets SIOP 2020/2021 MMLU-Pro VCTK

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	go ahead, make my data	Final_score	61.72	—	Unverified
2	#GreedyCow	Final_score	61.63	—	Unverified
3	Don't Ask Us y	Final_score	61.4	—	Unverified
4	Data_and_Confused	Final_score	60.96	—	Unverified
5	Waffles	Final_score	60.91	—	Unverified
6	raaka	Final_score	60.91	—	Unverified
7	Team Procrustination	Final_score	60.64	—	Unverified
8	Axiom Consulting Partners	Final_score	60.63	—	Unverified
9	Lets_Be_Fair	Final_score	60.23	—	Unverified
10	gooners	Final_score	60.22	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Orange-mini	0-shot MRR	99.19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HybridBeam+	SI-SDRi	13.3	—	Unverified