SOTAVerified

MMLU

Papers

Showing 101150 of 340 papers

TitleStatusHype
Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation0
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
An Assessment of Model-On-Model Deception0
Cost-Saving LLM Cascades with Early Abstention0
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning0
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection0
ALLaM: Large Language Models for Arabic and English0
Cost-aware LLM-based Online Dataset Annotation0
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks0
GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs0
GAAPO: Genetic Algorithmic Applied to Prompt Optimization0
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy0
Continuous Approximations for Improving Quantization Aware Training of LLMs0
CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge0
Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training0
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
GEB-1.3B: Open Lightweight Large Language Model0
GECKO: Generative Language Model for English, Code and Korean0
GEM: Empowering LLM for both Embedding Generation and Language Understanding0
Measuring Hong Kong Massive Multi-Task Language Understanding0
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning0
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation0
Few-Shot Recalibration of Language Models0
GRIN: GRadient-INformed MoE0
Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities0
AgentInstruct: Toward Generative Teaching with Agentic Flows0
Evaluation of large language models using an Indian language LGBTI+ lexicon0
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems0
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models0
Enterprise Large Language Model Evaluation Benchmark0
A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets0
LLMs Outperform Experts on Challenging Biology Benchmarks0
Eir: Thai Medical Large Language Models0
AcademicGPT: Empowering Academic Research0
Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma20
Uncovering Latent Chain of Thought Vectors in Language Models0
Measuring Progress on Scalable Oversight for Large Language Models0
Changing Answer Order Can Decrease MMLU Accuracy0
Efficient Model Development through Fine-tuning Transfer0
Efficiently Deploying LLMs with Controlled Risk0
LLaMA Beyond English: An Empirical Study on Language Capability Transfer0
Efficient Federated Search for Retrieval-Augmented Generation0
Efficient Data Selection at Scale via Influence Distillation0
ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers0
Effectiveness of Zero-shot-CoT in Japanese Prompts0
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
Lizard: An Efficient Linearization Framework for Large Language Models0
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction0
Show:102550
← PrevPage 3 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified