SOTAVerified

MMLU

Papers

Showing 101150 of 340 papers

TitleStatusHype
Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning0
Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMsCode0
Lizard: An Efficient Linearization Framework for Large Language Models0
Integrating External Tools with Large Language Models to Improve Accuracy0
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen SubstrateCode0
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode RepresentationsCode0
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training0
Multi-lingual Functional Evaluation for Large Language Models0
Enterprise Large Language Model Evaluation Benchmark0
Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content0
Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training0
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models0
Slimming Down LLMs Without Losing Their Minds0
MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing0
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
GEM: Empowering LLM for both Embedding Generation and Language Understanding0
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMsCode0
Model Unlearning via Sparse Autoencoder Subspace Guided Projections0
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM EvaluationCode0
Revisiting Uncertainty Estimation and Calibration of Large Language Models0
DyePack: Provably Flagging Test Set Contamination in LLMs Using BackdoorsCode0
Actor-Critic based Online Data Mixing For Language Model Pre-Training0
Large Language Models Often Know When They Are Being Evaluated0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
Interleaved Reasoning for Large Language Models via Reinforcement Learning0
The Price of Format: Diversity Collapse in LLMsCode0
Efficient Data Selection at Scale via Influence Distillation0
BnMMLU: Measuring Massive Multitask Language Understanding in BengaliCode0
B-score: Detecting biases in large language models using response history0
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-TuningCode0
INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling0
Cost-aware LLM-based Online Dataset Annotation0
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning0
Dual Decomposition of Weights and Singular Value Low Rank Adaptation0
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst0
Void in Language ModelsCode0
Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained SettingsCode0
Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation0
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language ModelsCode0
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection0
SEM: Reinforcement Learning for Search-Efficient Large Language Models0
A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets0
Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma20
LLMs Outperform Experts on Challenging Biology Benchmarks0
Measuring Hong Kong Massive Multi-Task Language Understanding0
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
Show:102550
← PrevPage 3 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified