GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 326–350 of 439 papers

Title	Date	Tasks	Status
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement	Sep 18, 2024	GSM8KMath	—Unverified
CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks	Sep 13, 2024	ARCCode Generation	—Unverified
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning	Sep 10, 2024	GSM8KMixture-of-Experts	—Unverified
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation	Sep 5, 2024	GSM8K	—Unverified
Building Math Agents with Multi-Turn Iterative Preference Learning	Sep 4, 2024	GSM8KMath	—Unverified
Prompt Baking	Sep 4, 2024	ARCGSM8K	—Unverified
S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners	Sep 3, 2024	GSM8KMath	—Unverified
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic	Aug 29, 2024	GSM8KLanguage Modeling	—Unverified
Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems	Aug 29, 2024	GSM8KLanguage Modeling	—Unverified
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models	Aug 28, 2024	Data AugmentationGSM8K	—Unverified
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs	Aug 18, 2024	DiversityGPU	—Unverified
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models	Aug 16, 2024	GSM8KMMLU	—Unverified
Cool-Fusion: Fuse Large Language Models without Training	Jul 29, 2024	Combinatorial OptimizationGSM8K	—Unverified
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost	Jul 29, 2024	GSM8KPrompt Engineering	—Unverified
Reliable Reasoning Beyond Natural Language	Jul 16, 2024	GSM8KMathematical Reasoning	—Unverified
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models	Jul 12, 2024	GSM8KMath	—Unverified
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist	Jul 11, 2024	GSM8KMath	—Unverified
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On	Jul 11, 2024	GSM8KMath	—Unverified
When is the consistent prediction likely to be a correct prediction?	Jul 8, 2024	GSM8KPrediction	—Unverified
Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks	Jul 4, 2024	GSM8KStrategyQA	—Unverified
metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models	Jul 4, 2024	ARCGSM8K	CodeCode Available
AgentInstruct: Toward Generative Teaching with Agentic Flows	Jul 3, 2024	GSM8KMMLU	—Unverified
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs	Jul 1, 2024	DiversityGSM8K	—Unverified
Advancing Process Verification for Large Language Models via Tree-Based Preference Learning	Jun 29, 2024	Binary ClassificationGSM8K	—Unverified
LiteSearch: Efficacious Tree Search for LLM	Jun 29, 2024	GSM8KMathematical Reasoning	—Unverified

Show:10 25 50

← PrevPage 14 of 18Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified