SOTAVerified

MMLU

Papers

Showing 101150 of 340 papers

TitleStatusHype
Efficient Federated Search for Retrieval-Augmented Generation0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks0
Distributional Scaling Laws for Emergent Capabilities0
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks0
Detecting Benchmark Contamination Through Watermarking0
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs0
Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay PerspectiveCode0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception0
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic LanguagesCode1
Leveraging Uncertainty Estimation for Efficient LLM Routing0
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning0
ORI: O Routing Intelligence0
Cost-Saving LLM Cascades with Early Abstention0
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models0
Forget What You Know about LLMs Evaluations - LLMs are Like a ChameleonCode0
OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like MechanismsCode0
RoToR: Towards More Reliable Responses for Order-Invariant InputsCode0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
LM2: Large Memory ModelsCode1
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy0
Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training0
QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-TuningCode0
Evaluation of Large Language Models via Coupled Token GenerationCode0
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?0
LLM-Powered Benchmark Factory: Reliable, Generic, and EfficientCode0
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM PerformanceCode0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
Humanity's Last Exam0
On the Reasoning Capacity of AI Models and How to Quantify It0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs0
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension DiscrepancyCode0
Control LLM: Controlled Evolution for Intelligence Retention in LLMCode1
DNA 1.0 Technical Report0
Inference-Time-Compute: More Faithful? A Research Note0
CHAIR -- Classifier of Hallucination as ImproverCode0
Unraveling Indirect In-Context Learning Using Influence Functions0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design0
MMLU-CF: A Contamination-free Multi-task Language Understanding BenchmarkCode2
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case StudyCode0
ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers0
Show:102550
← PrevPage 3 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified