GSM8K

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 439 papers

Title	Date	Tasks	Status	Score
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems	Sep 30, 2024	GSM8KMath	CodeCode Available	5
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving	Oct 19, 2023	GSM8KMath	CodeCode Available	5
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation	Jun 16, 2024	Continual LearningGSM8K	CodeCode Available	5
SMART: Self-learning Meta-strategy Agent for Reasoning Tasks	Oct 21, 2024	GSM8KSelf-Learning	CodeCode Available	5
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations	Nov 22, 2023	Common Sense ReasoningGSM8K	CodeCode Available	5
Text-to-LoRA: Instant Transformer Adaption	Jun 6, 2025	ARCGSM8K	CodeCode Available	5
metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models	Jul 4, 2024	ARCGSM8K	CodeCode Available	5
The Price of Format: Diversity Collapse in LLMs	May 25, 2025	DiversityGSM8K	CodeCode Available	5
TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation	Feb 19, 2025	Dataset GenerationGSM8K	CodeCode Available	5
TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students	May 2, 2025	GSM8KIn-Context Learning	CodeCode Available	5
Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting	Feb 5, 2025	GSM8KMath	CodeCode Available	5
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation	Jun 25, 2024	ARCBenchmarking	CodeCode Available	5
Iterative Reasoning Preference Optimization	Apr 30, 2024	ARCGSM8K	—Unverified	0
Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning	Apr 17, 2024	GSM8KNavigate	—Unverified	0
MALT: Improving Reasoning with Multi-Agent LLM Training	Dec 2, 2024	Common Sense ReasoningGSM8K	—Unverified	0
MAmmoTH2: Scaling Instructions from the Web	May 6, 2024	ChatbotGSM8K	—Unverified	0
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist	Jul 11, 2024	GSM8KMath	—Unverified	0
Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs	Jan 21, 2025	GSM8KIn-Context Learning	—Unverified	0
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning	Jun 1, 2023	GSM8KLanguage Modeling	—Unverified	0
MathAttack: Attacking Large Language Models Towards Math Solving Ability	Sep 4, 2023	Adversarial AttackGSM8K	—Unverified	0
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models	Feb 18, 2025	Data AugmentationGSM8K	—Unverified	0
Instance-adaptive Zero-shot Chain-of-Thought Prompting	Sep 30, 2024	GSM8KMath	—Unverified	0
MathDivide: Improved mathematical reasoning by large language models	May 12, 2024	GSM8KLogical Reasoning	—Unverified	0
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling	Oct 18, 2024	Computational EfficiencyGSM8K	—Unverified	0
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task	Feb 17, 2025	Code CompletionGSM8K	—Unverified	0
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs	Feb 26, 2024	GSM8KMath	—Unverified	0
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion	Jan 6, 2025	GSM8KHumanEval	—Unverified	0
Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping	May 13, 2025	Domain GeneralizationGSM8K	—Unverified	0
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification	Oct 5, 2024	GSM8KMath	—Unverified	0
Maximizing Confidence Alone Improves Reasoning	May 28, 2025	GSM8KMath	—Unverified	0
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients	May 3, 2025	GSM8KMMLU	—Unverified	0
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving	May 20, 2024	GSM8KMath	—Unverified	0
Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach	Mar 17, 2025	GSM8KMath	—Unverified	0
Improve Mathematical Reasoning in Language Models by Automated Process Supervision	Jun 5, 2024	GSM8KMath	—Unverified	0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs	Oct 15, 2024	GSM8KMath	—Unverified	0
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time	May 25, 2024	GSM8KMath	—Unverified	0
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs	Jul 1, 2024	DiversityGSM8K	—Unverified	0
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference	Nov 27, 2024	GSM8KLanguage Modeling	—Unverified	0
Model Unlearning via Sparse Autoencoder Subspace Guided Projections	May 30, 2025	Adversarial Robustnessfeature selection	—Unverified	0
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization	Feb 14, 2025	GSM8KInference Optimization	—Unverified	0
Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation	Jun 9, 2025	GSM8KHumanEval	—Unverified	0
Multi-Reference Preference Optimization for Large Language Models	May 26, 2024	GSM8KTruthfulQA	—Unverified	0
Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision	Feb 5, 2024	GSM8KMath	—Unverified	0
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements	Feb 13, 2024	GSM8KMath	—Unverified	0
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems	Jul 17, 2025	DiversityGSM8K	—Unverified	0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference	Oct 4, 2023	BenchmarkingGPU	—Unverified	0
From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting	Dec 18, 2023	DiversityGSM8K	—Unverified	0
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education	Feb 19, 2025	DiagnosticGSM8K	—Unverified	0
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute	Jun 18, 2025	continuous-controlContinuous Control	—Unverified	0
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning	Nov 14, 2023	GSM8KMath	—Unverified	0

Show:10 25 50

← PrevPage 5 of 9Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Accuracy	98.1	—	Unverified
2	Orange-mini	0-shot MRR	98	—	Unverified