SOTAVerified

MMLU

Papers

Showing 151200 of 340 papers

TitleStatusHype
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs0
Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural AdjustmentsCode1
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models0
Llama 3 Meets MoE: Efficient UpcyclingCode0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering0
HadaCore: Tensor Core Accelerated Hadamard Transform KernelCode3
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation0
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset0
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCode0
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents0
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models0
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference0
Predicting Emergent Capabilities by Finetuning0
Learning from "Silly" Questions Improves Large Language Models, But Only Slightly0
GenBFA: An Evolutionary Optimization Approach to Bit-Flip Attacks on LLMs0
Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models0
Reasoning Robustness of LLMs to Adversarial Typographical Errors0
Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents0
TODO: Enhancing LLM Alignment with Ternary PreferencesCode0
Project MPG: towards a generalized performance benchmark for LLM capabilities0
Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language ModelsCode1
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-DesignCode1
LOGO -- Long cOntext aliGnment via efficient preference OptimizationCode1
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment0
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning0
BenTo: Benchmark Task Reduction with In-Context TransferabilityCode0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs0
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical ReasoningCode1
LoLCATs: On Low-Rank Linearizing of Large Language ModelsCode3
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context LearningCode0
Towards Multilingual LLM Evaluation for European Languages0
Upcycling Large Language Models into Mixture of Experts0
Large Language Model Compression with Neural Architecture Search0
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual AlignmentCode1
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths0
Continuous Approximations for Improving Quantization Aware Training of LLMs0
LLM-TOPLA: Efficient LLM Ensemble by Maximising DiversityCode0
CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data PartitionsCode0
BrainTransformers: SNN-LLM0
Efficiently Deploying LLMs with Controlled Risk0
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs0
SSR: Alignment-Aware Modality Connector for Speech Language Models0
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining0
Instance-adaptive Zero-shot Chain-of-Thought Prompting0
Uncovering Latent Chain of Thought Vectors in Language Models0
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination0
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoningCode1
GRIN: GRadient-INformed MoE0
Show:102550
← PrevPage 4 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified