SOTAVerified

MMLU

Papers

Showing 101150 of 340 papers

TitleStatusHype
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen SubstrateCode0
Training-Free Exponential Context Extension via Cascading KV CacheCode0
The Price of Format: Diversity Collapse in LLMsCode0
Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMsCode0
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
TODO: Enhancing LLM Alignment with Ternary PreferencesCode0
Forget What You Know about LLMs Evaluations - LLMs are Like a ChameleonCode0
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension DiscrepancyCode0
Evaluation of Large Language Models via Coupled Token GenerationCode0
SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge IsolationCode0
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM EvaluationCode0
CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data PartitionsCode0
Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following demonstrationsCode0
RoToR: Towards More Reliable Responses for Order-Invariant InputsCode0
EmPO: Emotion Grounding for Empathetic Response Generation through Preference OptimizationCode0
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode RepresentationsCode0
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language ModelsCode0
Void in Language ModelsCode0
ChatBench: From Static Benchmarks to Human-AI EvaluationCode0
QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-TuningCode0
CHAIR -- Classifier of Hallucination as ImproverCode0
Effective Skill Unlearning through Intervention and AbstentionCode0
ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance LabelingCode0
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay PerspectiveCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
DyePack: Provably Flagging Test Set Contamination in LLMs Using BackdoorsCode0
Post-Hoc Reversal: Are We Selecting Models Prematurely?Code0
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization FunctionCode0
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language ModelsCode0
Probing then Editing Response Personality of Large Language ModelsCode0
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMsCode0
OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like MechanismsCode0
BnMMLU: Measuring Massive Multitask Language Understanding in BengaliCode0
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case StudyCode0
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context LearningCode0
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMsCode0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCode0
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM PerformanceCode0
Inconsistencies in Masked Language ModelsCode0
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-TuningCode0
LM-Cocktail: Resilient Tuning of Language Models via Model MergingCode0
Instruction Tuning with Human CurriculumCode0
BenTo: Benchmark Task Reduction with In-Context TransferabilityCode0
Input Conditioned Graph Generation for Language AgentsCode0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical ReasoningCode0
Llama 3 Meets MoE: Efficient UpcyclingCode0
LLM-Powered Benchmark Factory: Reliable, Generic, and EfficientCode0
An Empirical Study of Mamba-based Language ModelsCode0
Show:102550
← PrevPage 3 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1go ahead, make my dataFinal_score61.72Unverified
2#GreedyCowFinal_score61.63Unverified
3Don't Ask Us yFinal_score61.4Unverified
4Data_and_ConfusedFinal_score60.96Unverified
5WafflesFinal_score60.91Unverified
6raakaFinal_score60.91Unverified
7Team ProcrustinationFinal_score60.64Unverified
8Axiom Consulting PartnersFinal_score60.63Unverified
9Lets_Be_FairFinal_score60.23Unverified
10goonersFinal_score60.22Unverified
#ModelMetricClaimedVerifiedStatus
1Orange-mini0-shot MRR99.19Unverified
#ModelMetricClaimedVerifiedStatus
1HybridBeam+SI-SDRi13.3Unverified