SOTAVerified

MMLU

Papers

Showing 151200 of 340 papers

TitleStatusHype
SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge IsolationCode0
Transferable text data distillation by trajectory matching0
Probing then Editing Response Personality of Large Language ModelsCode0
Domain-Adaptive Continued Pre-Training of Small Language Models0
Large Language Models Could Be Rote Learners0
GAAPO: Genetic Algorithmic Applied to Prompt Optimization0
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
Order Independence With Finetuning0
Effective Skill Unlearning through Intervention and AbstentionCode0
Efficient Model Development through Fine-tuning Transfer0
ChatBench: From Static Benchmarks to Human-AI EvaluationCode0
Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems0
SuperBPE: Space Travel for Language Models0
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama0
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
Effectiveness of Zero-shot-CoT in Japanese Prompts0
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation0
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework0
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning0
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size0
KurTail : Kurtosis-based LLM Quantization0
When an LLM is apprehensive about its answers -- and when its uncertainty is justifiedCode0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering0
PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation0
Voting or Consensus? Decision-Making in Multi-Agent DebateCode0
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?0
Efficient Federated Search for Retrieval-Augmented Generation0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
Detecting Benchmark Contamination Through Watermarking0
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks0
Distributional Scaling Laws for Emergent Capabilities0
Evaluating Expert Contributions in a MoE LLM for Quiz-Based Tasks0
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs0
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests0
Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay PerspectiveCode0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception0
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance0
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning0
Leveraging Uncertainty Estimation for Efficient LLM Routing0
ORI: O Routing Intelligence0
Cost-Saving LLM Cascades with Early Abstention0
Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models0
Forget What You Know about LLMs Evaluations - LLMs are Like a ChameleonCode0
OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like MechanismsCode0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
RoToR: Towards More Reliable Responses for Order-Invariant InputsCode0
Show:102550
← PrevPage 4 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5raakaFinal_score60.91Unverified
6WafflesFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified