SOTAVerified

MMLU

Papers

Showing 51100 of 340 papers

TitleStatusHype
HELM: Hyperbolic Large Language Models via Mixture-of-Curvature ExpertsCode1
Training Step-Level Reasoning Verifiers with Formal Verification ToolsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM SystemsCode1
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM EvaluationCode1
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reportsCode1
Video-MMLU: A Massive Multi-Discipline Lecture Understanding BenchmarkCode1
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for CompressionCode1
Mobile-MMLU: A Mobile Intelligence Language Understanding BenchmarkCode1
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic LanguagesCode1
LM2: Large Memory ModelsCode1
MyGO Multiplex CoT: A Method for Self-Reflection in Large Language Models via Double Chain of Thought ThinkingCode1
Control LLM: Controlled Evolution for Intelligence Retention in LLMCode1
Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural AdjustmentsCode1
Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language ModelsCode1
LOGO -- Long cOntext aliGnment via efficient preference OptimizationCode1
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-DesignCode1
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical ReasoningCode1
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual AlignmentCode1
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoningCode1
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language ModelsCode1
A deeper look at depth pruning of LLMsCode1
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMsCode1
The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleCode1
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer MergingCode1
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language ModelsCode1
LiveMind: Low-latency Large Language Models with Simultaneous InferenceCode1
OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuningCode1
Instruction Tuning With Loss Over InstructionsCode1
LawInstruct: A Resource for Studying Language Model Adaptation to the Legal DomainCode1
Unfamiliar Finetuning Examples Control How Language Models HallucinateCode1
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question AnsweringCode1
Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model OptimizersCode1
Gemini: A Family of Highly Capable Multimodal ModelsCode1
Efficient Online Data Mixing For Language Model Pre-TrainingCode1
Prompt Optimization via Adversarial In-Context LearningCode1
ArcMMLU: A Library and Information Science Benchmark for Large Language ModelsCode1
ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and QuantizationCode1
An Open Source Data Contamination Report for Large Language ModelsCode1
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language ModelsCode1
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent CollaborationCode1
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from ScratchCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-InCode1
The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language ModelsCode1
Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text TransformersCode1
Towards Expert-Level Medical Question Answering with Large Language ModelsCode1
From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction TuningCode1
Large Language Models Encode Clinical KnowledgeCode1
UL2: Unifying Language Learning ParadigmsCode1
Show:102550
← PrevPage 2 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified