SOTAVerified

MMLU

Papers

Showing 151200 of 340 papers

TitleStatusHype
Domain-Adaptive Continued Pre-Training of Small Language Models0
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining0
Dual Decomposition of Weights and Singular Value Low Rank Adaptation0
CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge0
Effectiveness of Zero-shot-CoT in Japanese Prompts0
Efficient Data Selection at Scale via Influence Distillation0
Efficient Federated Search for Retrieval-Augmented Generation0
Efficiently Deploying LLMs with Controlled Risk0
Efficient Model Development through Fine-tuning Transfer0
Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities0
Eir: Thai Medical Large Language Models0
Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma20
Enterprise Large Language Model Evaluation Benchmark0
Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems0
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks0
Evaluation of large language models using an Indian language LGBTI+ lexicon0
Few-Shot Recalibration of Language Models0
FRAMES: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy0
GAAPO: Genetic Algorithmic Applied to Prompt Optimization0
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models0
Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training0
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks0
GEB-1.3B: Open Lightweight Large Language Model0
GECKO: Generative Language Model for English, Code and Korean0
GEM: Empowering LLM for both Embedding Generation and Language Understanding0
A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets0
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation0
GRIN: GRadient-INformed MoE0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models0
Humanity's Last Exam0
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents0
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths0
Reasoning Robustness of LLMs to Adversarial Typographical Errors0
Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training0
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models0
Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training0
Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment0
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?0
Actor-Critic based Online Data Mixing For Language Model Pre-Training0
Revisiting Uncertainty Estimation and Calibration of Large Language Models0
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models0
AcademicGPT: Empowering Academic Research0
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models0
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers0
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models0
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst0
SEM: Reinforcement Learning for Search-Efficient Large Language Models0
Show:102550
← PrevPage 4 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified