SOTAVerified

MMLU

Papers

Showing 101125 of 340 papers

TitleStatusHype
Efficient Federated Search for Retrieval-Augmented Generation0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks0
Distributional Scaling Laws for Emergent Capabilities0
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks0
Detecting Benchmark Contamination Through Watermarking0
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs0
Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay PerspectiveCode0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception0
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic LanguagesCode1
Leveraging Uncertainty Estimation for Efficient LLM Routing0
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning0
ORI: O Routing Intelligence0
Cost-Saving LLM Cascades with Early Abstention0
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models0
Forget What You Know about LLMs Evaluations - LLMs are Like a ChameleonCode0
OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like MechanismsCode0
RoToR: Towards More Reliable Responses for Order-Invariant InputsCode0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
LM2: Large Memory ModelsCode1
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy0
Show:102550
← PrevPage 5 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified