SOTAVerified

MMLU

Papers

Showing 251300 of 340 papers

TitleStatusHype
Large Language Model Compression with Neural Architecture Search0
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths0
Continuous Approximations for Improving Quantization Aware Training of LLMs0
CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data PartitionsCode0
LLM-TOPLA: Efficient LLM Ensemble by Maximising DiversityCode0
BrainTransformers: SNN-LLM0
Efficiently Deploying LLMs with Controlled Risk0
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining0
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs0
Instance-adaptive Zero-shot Chain-of-Thought Prompting0
SSR: Alignment-Aware Modality Connector for Speech Language Models0
Uncovering Latent Chain of Thought Vectors in Language Models0
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination0
GRIN: GRadient-INformed MoE0
Eir: Thai Medical Large Language Models0
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks0
Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models0
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMsCode0
Performance Law of Large Language ModelsCode0
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models0
Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning0
BOTS-LM: Training Large Language Models for Setswana0
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design0
ALLaM: Large Language Models for Arabic and English0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
AgentInstruct: Toward Generative Teaching with Agentic Flows0
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning0
Changing Answer Order Can Decrease MMLU Accuracy0
EmPO: Emotion Grounding for Empathetic Response Generation through Preference OptimizationCode0
Training-Free Exponential Context Extension via Cascading KV CacheCode0
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling0
DEM: Distribution Edited Model for Training with Mixed Data Distributions0
Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback0
Optimised Grouped-Query Attention Mechanism for Transformers0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Understanding Finetuning for Factual Knowledge Extraction0
Input Conditioned Graph Generation for Language AgentsCode0
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance0
Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
Reactor Mk.1 performances: MMLU, HumanEval and BBH test results0
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models0
Quantifying Variance in Evaluation Benchmarks0
GEB-1.3B: Open Lightweight Large Language Model0
An Empirical Study of Mamba-based Language Models0
Does your data spark joy? Performance gains from domain upsampling at the end of training0
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization FunctionCode0
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures0
Spanish and LLM Benchmarks: is MMLU Lost in Translation?0
GECKO: Generative Language Model for English, Code and Korean0
Show:102550
← PrevPage 6 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified