SOTAVerified

MMLU

Papers

Showing 201250 of 340 papers

TitleStatusHype
SSR: Alignment-Aware Modality Connector for Speech Language Models0
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework0
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning0
SuperBPE: Space Travel for Language Models0
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models0
SUTRA: Scalable Multilingual Language Model Architecture0
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs0
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning0
Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models0
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise0
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback0
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance0
The Claude 3 Model Family: Opus, Sonnet, Haiku0
The Poison of Alignment0
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?0
Uncovering Latent Chain of Thought Vectors in Language Models0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
Towards Multilingual LLM Evaluation for European Languages0
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception0
Towards Uncertainty-Aware Language Agent0
Transcending Scaling Laws with 0.1% Extra Compute0
Transferable text data distillation by trajectory matching0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
Understanding Finetuning for Factual Knowledge Extraction0
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size0
Unraveling Indirect In-Context Learning Using Influence Functions0
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach0
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs0
Upcycling Large Language Models into Mixture of Experts0
Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content0
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination0
BrainTransformers: SNN-LLM0
B-score: Detecting biases in large language models using response history0
ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers0
Changing Answer Order Can Decrease MMLU Accuracy0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning0
Continuous Approximations for Improving Quantization Aware Training of LLMs0
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks0
Cost-aware LLM-based Online Dataset Annotation0
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning0
Cost-Saving LLM Cascades with Early Abstention0
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation0
Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting0
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection0
GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs0
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling0
DEM: Distribution Edited Model for Training with Mixed Data Distributions0
Detecting Benchmark Contamination Through Watermarking0
Show:102550
← PrevPage 5 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified