MMLU

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 301–340 of 340 papers

Title	Date	Tasks	Status
Training-Free Exponential Context Extension via Cascading KV Cache	Jun 24, 2024	Book summarizationComputational Efficiency	CodeCode Available
Void in Language Models	May 20, 2025	MMLUResponse Generation	CodeCode Available
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors	May 29, 2025	MMLUMultiple-choice	CodeCode Available
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging	Feb 25, 2025	MMLUMultiple-choice	CodeCode Available
RoToR: Towards More Reliable Responses for Order-Invariant Inputs	Feb 10, 2025	Graph Question AnsweringMMLU	CodeCode Available
Inconsistencies in Masked Language Models	Dec 30, 2022	LAMBADAMMLU	CodeCode Available
metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models	Jul 4, 2024	ARCGSM8K	CodeCode Available
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning	May 24, 2025	Computational EfficiencyMMLU	CodeCode Available
OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like Mechanisms	Feb 11, 2025	Knowledge DistillationMMLU	CodeCode Available
CHAIR -- Classifier of Hallucination as Improver	Jan 5, 2025	HallucinationMMLU	CodeCode Available
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy	Jan 20, 2025	MMLU	CodeCode Available
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study	Dec 19, 2024	AstronomyDomain Adaptation	CodeCode Available
CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions	Oct 4, 2024	Instruction FollowingMMLU	CodeCode Available
BenTo: Benchmark Task Reduction with In-Context Transferability	Oct 17, 2024	In-Context LearningMMLU	CodeCode Available
LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity	Oct 4, 2024	DiversityEnsemble Pruning	CodeCode Available
Performance Law of Large Language Models	Aug 19, 2024	MMLU	CodeCode Available
Evaluation of Large Language Models via Coupled Token Generation	Feb 3, 2025	ChatbotLarge Language Model	CodeCode Available
Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrations	Aug 27, 2023	Instruction FollowingMMLU	CodeCode Available
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function	Jun 3, 2024	DiversityMMLU	CodeCode Available
Post-Hoc Reversal: Are We Selecting Models Prematurely?	Apr 11, 2024	Language ModellingMMLU	CodeCode Available
The Price of Format: Diversity Collapse in LLMs	May 25, 2025	DiversityGSM8K	CodeCode Available
Capability-Based Scaling Laws for LLM Red-Teaming	May 26, 2025	MMLUPrompt Engineering	CodeCode Available
Probing then Editing Response Personality of Large Language Models	Apr 14, 2025	MMLU	CodeCode Available
SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation	Apr 17, 2025	AttributeMachine Unlearning	CodeCode Available
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation	Jun 16, 2024	Continual LearningGSM8K	CodeCode Available
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient	Feb 2, 2025	MMLU	CodeCode Available
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate	Jul 8, 2025	Continual LearningMixture-of-Experts	CodeCode Available
Voting or Consensus? Decision-Making in Multi-Agent Debate	Feb 26, 2025	Decision MakingMMLU	CodeCode Available
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon	Feb 11, 2025	MMLU	CodeCode Available
QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-Tuning	Feb 3, 2025	Data ValuationLanguage Modeling	CodeCode Available
When an LLM is apprehensive about its answers -- and when its uncertainty is justified	Mar 3, 2025	MathMMLU	CodeCode Available
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation	May 30, 2025	Continual PretrainingFairness	CodeCode Available
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs	May 31, 2025	MMLU	CodeCode Available
EmPO: Emotion Grounding for Empathetic Response Generation through Preference Optimization	Jun 27, 2024	DiversityEmpathetic Response Generation	CodeCode Available
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations	Jul 7, 2025	AttributeMMLU	CodeCode Available
Input Conditioned Graph Generation for Language Agents	Jun 17, 2024	Graph GenerationMMLU	CodeCode Available
TODO: Enhancing LLM Alignment with Ternary Preferences	Nov 2, 2024	ARCMMLU	CodeCode Available
Effective Skill Unlearning through Intervention and Abstention	Mar 27, 2025	General KnowledgeMath	CodeCode Available
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning	Oct 14, 2024	In-Context LearningMMLU	CodeCode Available
Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs	Jul 15, 2025	DiversityMMLU	CodeCode Available

Show:10 25 50

← PrevPage 7 of 7Next →

All datasets SIOP 2020/2021 MMLU-Pro VCTK

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	go ahead, make my data	Final_score	61.72	—	Unverified
2	#GreedyCow	Final_score	61.63	—	Unverified
3	Don't Ask Us y	Final_score	61.4	—	Unverified
4	Data_and_Confused	Final_score	60.96	—	Unverified
5	Waffles	Final_score	60.91	—	Unverified
6	raaka	Final_score	60.91	—	Unverified
7	Team Procrustination	Final_score	60.64	—	Unverified
8	Axiom Consulting Partners	Final_score	60.63	—	Unverified
9	Lets_Be_Fair	Final_score	60.23	—	Unverified
10	gooners	Final_score	60.22	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Orange-mini	0-shot MRR	99.19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	HybridBeam+	SI-SDRi	13.3	—	Unverified