Mathematical Reasoning

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 751–800 of 805 papers

Title	Date	Tasks	Status
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning	Oct 18, 2024	MathMathematical Reasoning	—Unverified
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback	Jan 18, 2025	MathMathematical Reasoning	—Unverified
Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs	May 19, 2025	Mathematical ReasoningReinforcement Learning (RL)	—Unverified
Subtle Errors Matter: Preference Learning via Error-injected Self-editing	Oct 9, 2024	GSM8KMath	—Unverified
Supervised Optimism Correction: Be Confident When LLMs Are Sure	Apr 10, 2025	GSM8KMath	—Unverified
Sustainability of Collusion and Market Transparency in a Sequential Search Market: a Generalization	May 5, 2021	Mathematical Reasoning	—Unverified
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models	Feb 20, 2024	Instruction FollowingLogical Reasoning	—Unverified
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use	Apr 7, 2025	GSM8KMath	—Unverified
System-2 Mathematical Reasoning via Enriched Instruction Tuning	Dec 22, 2024	ERPGSM8K	—Unverified
Table as Thought: Exploring Structured Thoughts in LLM Reasoning	Jan 4, 2025	Mathematical Reasoning	—Unverified
Taming Generative Diffusion Prior for Universal Blind Image Restoration	Aug 21, 2024	Image RestorationMathematical Reasoning	—Unverified
Tangram: Benchmark for Evaluating Geometric Element Recognition in Large Multimodal Models	Aug 25, 2024	Mathematical Reasoning	—Unverified
Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving	Feb 17, 2025	MathMathematical Problem-Solving	—Unverified
TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving	Jun 12, 2025	Logical ReasoningMathematical Problem-Solving	—Unverified
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic	Jun 9, 2025	Mathematical Reasoning	—Unverified
Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset	Jun 25, 2025	Mathematical Reasoning	—Unverified
Text Generation Beyond Discrete Token Sampling	May 20, 2025	Code GenerationMathematical Reasoning	—Unverified
The Axiom-Based Atlas: A Structural Mapping of Theorems via Foundational Proof Vectors	Mar 31, 2025	Mathematical Reasoning	—Unverified
The Karp Dataset	Jan 24, 2025	BenchmarkingMathematical Reasoning	—Unverified
The Lessons of Developing Process Reward Models in Mathematical Reasoning	Jan 13, 2025	Mathematical Reasoning	—Unverified
Theorem Prover as a Judge for Synthetic Data Generation	Feb 18, 2025	Mathematical ProofsMathematical Reasoning	—Unverified
Theoretical Analysis of an XGBoost Framework for Product Cannibalization	Dec 2, 2021	Mathematical Reasoning	—Unverified
The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic	Jun 28, 2024	Language ModelingLanguage Modelling	—Unverified
The Role of General Intelligence in Mathematical Reasoning	Apr 27, 2021	Mathematical Reasoning	—Unverified
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs	May 23, 2025	Cross-Lingual TransferMath	—Unverified
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners	Feb 27, 2025	MambaMathematical Reasoning	—Unverified
Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains	May 22, 2025	Mathematical ReasoningReinforcement Learning (RL)	—Unverified
TinyGSM: achieving >80% on GSM8k with small language models	Dec 14, 2023	Arithmetic ReasoningGSM8K	—Unverified
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning	Feb 5, 2025	Language ModelingLanguage Modelling	—Unverified
Token-Level Uncertainty Estimation for Large Language Model Reasoning	May 16, 2025	Language ModelingLanguage Modelling	—Unverified
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models	Jul 12, 2024	GSM8KMath	—Unverified
Demystifying Chains, Trees, and Graphs of Thoughts	Jan 25, 2024	Mathematical ReasoningPrompt Engineering	—Unverified
Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving	Jun 20, 2025	Automated Theorem ProvingDiversity	—Unverified
Towards Efficient and Effective Alignment of Large Language Models	Jun 11, 2025	Mathematical ReasoningMeta-Learning	—Unverified
SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding	Aug 21, 2024	Logical ReasoningMathematical Reasoning	—Unverified
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning	Oct 9, 2024	Mathematical Reasoning	—Unverified
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems	May 21, 2025	BenchmarkingMath	—Unverified
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning	Feb 25, 2025	MathMathematical Reasoning	—Unverified
Towards Tractable Mathematical Reasoning: Challenges, Strategies, and Opportunities for Solving Math Word Problems	Oct 29, 2021	Answer GenerationMath	—Unverified
Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability	Mar 5, 2025	Language ModelingLanguage Modelling	—Unverified
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees	Oct 10, 2024	Mathematical Reasoning	—Unverified
Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning	Dec 4, 2024	GSM8KLanguage Modeling	—Unverified
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning	May 21, 2025	Language ModelingLanguage Modelling	—Unverified
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms	May 22, 2025	Adversarial AttackBenchmarking	—Unverified
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models	Jan 23, 2025	Mathematical Reasoning	—Unverified
Uncertainty-Aware Step-wise Verification with Generative Reward Models	Feb 16, 2025	Mathematical ReasoningUncertainty Quantification	—Unverified
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap	Jan 5, 2025	MathMathematical Reasoning	—Unverified
Uni-LoRA: One Vector is All You Need	Jun 1, 2025	AllMathematical Reasoning	—Unverified
Universal Self-Consistency for Large Language Model Generation	Nov 29, 2023	Code GenerationLanguage Modeling	—Unverified
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning	May 19, 2025	2kMathematical Reasoning	—Unverified

Show:10 25 50

← PrevPage 16 of 17Next →

All datasets AIME24 FrontierMath Lila (IID)Lila (OOD)PGPS9K AMC23 GeoQA Math500 UniGeo UniGeo (PRV)

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	Xolver	Acc	94.4	—	Unverified
2	DeepSeek-r1	Acc	79.8	—	Unverified
3	Openai-o1	Acc	74.4	—	Unverified
4	Openai-o1-mini	Acc	70	—	Unverified
5	Search-o1	Acc	56.7	—	Unverified
6	s1-32B	Acc	56.7	—	Unverified
7	Openai-o1-preview	Acc	44.6	—	Unverified
8	Qwen2.5-72B-Instruct	Acc	23.3	—	Unverified
9	Claude3.5-Sonnet	Acc	16	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	o3	Accuracy	0.25	—	Unverified
2	Gemini 1.5 Pro (002)	Accuracy	0.02	—	Unverified
3	GPT-4o	Accuracy	0.01	—	Unverified
4	o1-mini	Accuracy	0.01	—	Unverified
5	o1-preview	Accuracy	0.01	—	Unverified
6	Claude 3.5 Sonnet	Accuracy	0.01	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Codex (Few-Shot, 175B)	Accuracy	0.6	—	Unverified
2	Bhāskara-P (Fine-tuned, 2.7B)	Accuracy	0.48	—	Unverified
3	Neo-P (Fine-tuned, 2.7B)	Accuracy	0.39	—	Unverified
4	GPT-3 (Few-Shot, 175B)	Accuracy	0.38	—	Unverified
5	Bhāskara-A (Fine-tuned, 2.7B)	Accuracy	0.25	—	Unverified
6	Neo-A (Fine-tuned, 2.7B)	Accuracy	0.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Codex (Few-Shot, 175B)	Accuracy	0.59	—	Unverified
2	Bhāskara-P (Fine-tuned, 2.7B)	Accuracy	0.45	—	Unverified
3	GPT-3 (Few-Shot, 175B)	Accuracy	0.38	—	Unverified
4	Bhāskara-A (Fine-tuned, 2.7B)	Accuracy	0.27	—	Unverified
5	Neo-P (Fine-tuned, 2.7B)	Accuracy	0.24	—	Unverified
6	Neo-A (Fine-tuned, 2.7B)	Accuracy	0.18	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GOLD	Completion accuracy	65.8	—	Unverified
2	PGPSNet	Completion accuracy	62.7	—	Unverified
3	GAPS	Completion accuracy	61.2	—	Unverified
4	Inter-GPS	Completion accuracy	59.8	—	Unverified
5	Geoformer	Completion accuracy	35.6	—	Unverified
6	NGS	Completion accuracy	34.1	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	QWQ-32B-preview	Acc	82.5	—	Unverified
2	Math-Master	Acc	82	—	Unverified
3	Qwen2.5-Math-7B-instruct	Acc	62.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GOLD	Accuracy (%)	75.2	—	Unverified
2	GAPS	Accuracy (%)	67.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Search-o1	Acc	86.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GOLD	Accuracy (%)	98.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GAPS	Accuracy (%)	97.5	—	Unverified