SOTAVerified

MMLU

Papers

Showing 201250 of 340 papers

TitleStatusHype
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy0
Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training0
QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-TuningCode0
Evaluation of Large Language Models via Coupled Token GenerationCode0
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?0
LLM-Powered Benchmark Factory: Reliable, Generic, and EfficientCode0
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM PerformanceCode0
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
Humanity's Last Exam0
On the Reasoning Capacity of AI Models and How to Quantify It0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs0
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension DiscrepancyCode0
DNA 1.0 Technical Report0
Inference-Time-Compute: More Faithful? A Research Note0
CHAIR -- Classifier of Hallucination as ImproverCode0
Unraveling Indirect In-Context Learning Using Influence Functions0
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case StudyCode0
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design0
ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers0
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs0
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering0
Llama 3 Meets MoE: Efficient UpcyclingCode0
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation0
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCode0
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?0
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents0
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models0
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference0
Predicting Emergent Capabilities by Finetuning0
Learning from "Silly" Questions Improves Large Language Models, But Only Slightly0
GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs0
Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models0
Reasoning Robustness of LLMs to Adversarial Typographical Errors0
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents0
TODO: Enhancing LLM Alignment with Ternary PreferencesCode0
Project MPG: towards a generalized performance benchmark for LLM capabilities0
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment0
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning0
BenTo: Benchmark Task Reduction with In-Context TransferabilityCode0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs0
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context LearningCode0
Towards Multilingual LLM Evaluation for European Languages0
Upcycling Large Language Models into Mixture of Experts0
Show:102550
← PrevPage 5 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified