SOTAVerified

Memorization

Papers

Showing 125 of 1088 papers

TitleStatusHype
MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge DiscoveryCode7
Pythia: A Suite for Analyzing Large Language Models Across Training and ScalingCode6
LIMO: Less is More for ReasoningCode5
MUSE: Machine Unlearning Six-Way Evaluation for Language ModelsCode4
Amortized Planning with Large-Scale Transformers: A Case Study on ChessCode4
Parameter Efficient Instruction Tuning: An Empirical StudyCode4
VideoChat-Flash: Hierarchical Compression for Long-Context Video ModelingCode4
R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement LearningCode4
Grokking: Generalization Beyond Overfitting on Small Algorithmic DatasetsCode4
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsCode4
AgentTuning: Enabling Generalized Agent Abilities for LLMsCode3
MathArena: Evaluating LLMs on Uncontaminated Math CompetitionsCode3
From Matching to Generation: A Survey on Generative Information RetrievalCode3
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMsCode2
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language ModelsCode2
PaLM: Scaling Language Modeling with PathwaysCode2
HMT: Hierarchical Memory Transformer for Long Context Language ProcessingCode2
A Decade's Battle on Dataset Bias: Are We There Yet?Code2
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
Drive Like a Human: Rethinking Autonomous Driving with Large Language ModelsCode2
Detecting, Explaining, and Mitigating Memorization in Diffusion ModelsCode2
HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial OptimizationCode2
DS-1000: A Natural and Reliable Benchmark for Data Science Code GenerationCode2
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy DataCode2
Causal Reasoning and Large Language Models: Opening a New Frontier for CausalityCode2
Show:102550
← PrevPage 1 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PaLM-540B (few-shot, k=5)Accuracy95.4Unverified
2Gopher-280B (few-shot, k=5)Accuracy80Unverified
3PaLM-62B (few-shot, k=5)Accuracy77.7Unverified