SOTAVerified

MMLU

Papers

Showing 276300 of 340 papers

TitleStatusHype
Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training0
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
GEB-1.3B: Open Lightweight Large Language Model0
GECKO: Generative Language Model for English, Code and Korean0
GEM: Empowering LLM for both Embedding Generation and Language Understanding0
A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets0
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation0
GRIN: GRadient-INformed MoE0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models0
Humanity's Last Exam0
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling0
Inference-Time-Compute: More Faithful? A Research Note0
Instance-adaptive Zero-shot Chain-of-Thought Prompting0
Instruction Tuning with Human Curriculum0
Integrating External Tools with Large Language Models to Improve Accuracy0
Interleaved Reasoning for Large Language Models via Reinforcement Learning0
Investigating Data Contamination in Modern Benchmarks for Large Language Models0
Irreducible Curriculum for Language Model Pretraining0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs0
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations0
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning0
KurTail : Kurtosis-based LLM Quantization0
Show:102550
← PrevPage 12 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified