MMLU

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 340 papers

Title	Date	Tasks	Status	Hype	Score
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools	Jun 18, 2024	AllGSM8K	CodeCode Available	14	5
Qwen2 Technical Report	Jul 15, 2024	Arithmetic ReasoningGSM8K	CodeCode Available	13	5
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models	Feb 28, 2025	MMLU	CodeCode Available	11	5
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	Mar 26, 2024	GPUGSM8K	CodeCode Available	9	5
Yi: Open Foundation Models by 01.AI	Mar 7, 2024	AttributeChatbot	CodeCode Available	9	5
Efficient multi-prompt evaluation of LLMs	May 27, 2024	MMLU	CodeCode Available	7	5
Qwen2.5-Omni Technical Report	Mar 26, 2025	Automatic Speech Recognition (ASR)GSM8K	CodeCode Available	7	5
DataComp-LM: In search of the next generation of training sets for language models	Jun 17, 2024	Language ModellingMMLU	CodeCode Available	7	5
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training	May 23, 2024	GSM8KMixture-of-Experts	CodeCode Available	7	5
ART: Automatic multi-step reasoning and tool-use for large language models	Mar 16, 2023	MMLU	CodeCode Available	6	5
Training Compute-Optimal Large Language Models	Mar 29, 2022	AnachronismsAnalogical Similarity	CodeCode Available	6	5
Make Your LLM Fully Utilize the Context	Apr 25, 2024	4kInformation Retrieval	CodeCode Available	5	5
Baichuan 2: Open Large-scale Language Models	Sep 19, 2023	Feature EngineeringGSM8K	CodeCode Available	4	5
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text	Mar 27, 2024	ArticlesLanguage Modeling	CodeCode Available	4	5
Galactica: A Large Language Model for Science	Nov 16, 2022	AnachronismsBias Detection	CodeCode Available	4	5
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions	Aug 1, 2024	Medical Question AnsweringMedQA	CodeCode Available	4	5
YourBench: Easy Custom Evaluation Sets for Everyone	Apr 2, 2025	MMLU	CodeCode Available	3	5
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory	Apr 10, 2025	MathMMLU	CodeCode Available	3	5
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding	Apr 25, 2024	GSM8KHellaSwag	CodeCode Available	3	5
ChatMusician: Understanding and Generating Music Intrinsically with LLM	Feb 25, 2024	MMLUText Generation	CodeCode Available	3	5
General-Reasoner: Advancing LLM Reasoning Across All Domains	May 20, 2025	AllMath	CodeCode Available	3	5
Are We Done with MMLU?	Jun 6, 2024	MMLUVirology	CodeCode Available	3	5
DataDecide: How to Predict Best Pretraining Data with Small Experiments	Apr 15, 2025	ARCHellaSwag	CodeCode Available	3	5
ReasonIR: Training Retrievers for Reasoning Tasks	Apr 29, 2025	Information RetrievalMMLU	CodeCode Available	3	5
REPLUG: Retrieval-Augmented Black-Box Language Models	Jan 30, 2023	Language ModelingLanguage Modelling	CodeCode Available	3	5
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel	Dec 12, 2024	GPUMMLU	CodeCode Available	3	5
LoLCATs: On Low-Rank Linearizing of Large Language Models	Oct 14, 2024	MMLU	CodeCode Available	3	5
Scaling Instruction-Finetuned Language Models	Oct 20, 2022	Coreference ResolutionCross-Lingual Question Answering	CodeCode Available	3	5
Compact Language Models via Pruning and Knowledge Distillation	Jul 19, 2024	Knowledge DistillationLanguage Modeling	CodeCode Available	3	5
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark	Jun 3, 2024	MMLUMulti-task Language Understanding	CodeCode Available	3	5
What Matters in Transformers? Not All Attention is Needed	Jun 22, 2024	AllMMLU	CodeCode Available	2	5
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention	Feb 8, 2024	MMLUQuantization	CodeCode Available	2	5
A StrongREJECT for Empty Jailbreaks	Feb 15, 2024	MMLU	CodeCode Available	2	5
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents	Mar 13, 2024	Language ModelingLanguage Modelling	CodeCode Available	2	5
Routoo: Learning to Route to Large Language Models Effectively	Jan 25, 2024	MMLUMulti-task Language Understanding	CodeCode Available	2	5
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs	Apr 21, 2024	MMLURed Teaming	CodeCode Available	2	5
tinyBenchmarks: evaluating LLMs with fewer examples	Feb 22, 2024	MMLUMultiple-choice	CodeCode Available	2	5
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples	Nov 8, 2023	HumanEvalMMLU	CodeCode Available	2	5
Reinforcing General Reasoning without Verifiers	May 27, 2025	MathMathematical Reasoning	CodeCode Available	2	5
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization	Apr 8, 2025	MathMathematical Reasoning	CodeCode Available	2	5
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark	Dec 19, 2024	MMLUMultiple-choice	CodeCode Available	2	5
Atlas: Few-shot Learning with Retrieval Augmented Language Models	Aug 5, 2022	Fact CheckingFew-Shot Learning	CodeCode Available	2	5
any4: Learned 4-bit Numeric Representation for LLMs	Jul 7, 2025	GPUGSM8K	CodeCode Available	2	5
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning	Nov 16, 2023	MedQAMMLU	CodeCode Available	2	5
Inheritune: Training Smaller Yet More Attentive Language Models	Apr 12, 2024	DecoderLanguage Modelling	CodeCode Available	2	5
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models	Dec 11, 2023	BenchmarkingEmotional Intelligence	CodeCode Available	2	5
Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning	Dec 22, 2023	Instruction FollowingMixture-of-Experts	CodeCode Available	2	5
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models	Mar 28, 2025	MMLUQuantization	CodeCode Available	2	5
Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In	May 27, 2023	MMLURetrieval	CodeCode Available	1	5
Efficient Online Data Mixing For Language Model Pre-Training	Dec 5, 2023	Language ModelingLanguage Modelling	CodeCode Available	1	5

Show:10 25 50

← PrevPage 1 of 7Next →

All datasets SIOP 2020/2021 MMLU-Pro VCTK

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	go ahead, make my data	Final_score	61.72	—	Unverified
2	#GreedyCow	Final_score	61.63	—	Unverified
3	Don't Ask Us y	Final_score	61.4	—	Unverified
4	Data_and_Confused	Final_score	60.96	—	Unverified
5	Waffles	Final_score	60.91	—	Unverified
6	raaka	Final_score	60.91	—	Unverified
7	Team Procrustination	Final_score	60.64	—	Unverified
8	Axiom Consulting Partners	Final_score	60.63	—	Unverified
9	Lets_Be_Fair	Final_score	60.23	—	Unverified
10	gooners	Final_score	60.22	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Orange-mini	0-shot MRR	99.19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HybridBeam+	SI-SDRi	13.3	—	Unverified