SOTAVerified

GSM8K

Papers

Showing 251300 of 439 papers

TitleStatusHype
Reliable Reasoning Beyond Natural Language0
Qwen2 Technical ReportCode13
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist0
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On0
When is the consistent prediction likely to be a correct prediction?0
LoRA-GA: Low-Rank Adaptation with Gradient ApproximationCode3
metabench -- A Sparse Benchmark to Measure General Ability in Large Language ModelsCode0
Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks0
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical ReasoningCode1
AgentInstruct: Toward Generative Teaching with Agentic Flows0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs0
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical ReasoningCode1
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning0
LiteSearch: Efficacious Tree Search for LLM0
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsCode3
VarBench: Robust Language Model Benchmarking Through Dynamic Variable PerturbationCode0
PORT: Preference Optimization on Reasoning Traces0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language FeedbackCode1
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning0
Can LLMs Reason in the Wild with Programs?Code0
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All ToolsCode14
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive PrinciplesCode1
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based SamplingCode1
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank AdaptationCode0
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8BCode5
Uncertainty Aware Learning for Language Model Alignment0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision0
Does your data spark joy? Performance gains from domain upsampling at the end of training0
Automatic Instruction Evolving for Large Language ModelsCode3
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM DeploymentCode0
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths0
Arithmetic Reasoning with LLM: Prolog Generation & Permutation0
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of ParametersCode2
Multi-Reference Preference Optimization for Large Language Models0
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time0
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM TrainingCode7
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token IdentificationCode1
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-ContrastCode1
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by StepCode3
Multiple-Choice Questions are Efficient and Robust LLM EvaluatorsCode1
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics BenchmarkCode2
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving0
Meaning-Typed Programming: Language Abstraction and Runtime for Model-Integrated Applications0
MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical ReasoningCode3
MathDivide: Improved mathematical reasoning by large language models0
MAmmoTH2: Scaling Instructions from the Web0
Exploring the Compositional Deficiency of Large Language Models in Mathematical ReasoningCode2
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference LearningCode3
Show:102550
← PrevPage 6 of 9Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAccuracy98.1Unverified
2Orange-mini0-shot MRR98Unverified