SOTAVerified

MMLU

Papers

Showing 301340 of 340 papers

TitleStatusHype
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Large Language Models Could Be Rote Learners0
Large Language Models Often Know When They Are Being Evaluated0
Learning from "Silly" Questions Improves Large Language Models, But Only Slightly0
Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning0
Let's Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning0
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation0
Leveraging Uncertainty Estimation for Efficient LLM Routing0
Lizard: An Efficient Linearization Framework for Large Language Models0
Llama 3 Meets MoE: Efficient Upcycling0
LLaMA Beyond English: An Empirical Study on Language Capability Transfer0
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction0
Large Language Model Compression with Neural Architecture Search0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering0
LLMs Outperform Experts on Challenging Biology Benchmarks0
LM-Cocktail: Resilient Tuning of Language Models via Model Merging0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
An Empirical Study of Mamba-based Language Models0
Measuring Hong Kong Massive Multi-Task Language Understanding0
Measuring Progress on Scalable Oversight for Large Language Models0
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs0
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning0
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures0
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design0
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference0
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents0
An Assessment of Model-On-Model Deception0
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
Model Unlearning via Sparse Autoencoder Subspace Guided Projections0
MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing0
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
Multi-lingual Functional Evaluation for Large Language Models0
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models0
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset0
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Show:102550
← PrevPage 7 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified