SOTAVerified

Mathematical Reasoning

Papers

Showing 401450 of 805 papers

TitleStatusHype
STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning0
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models0
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning0
GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks0
Library Learning Doesn't: The Curious Case of the Single-Use "Library"Code0
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning0
Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical ProblemsCode1
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks0
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from ScratchCode2
SIKeD: Self-guided Iterative Knowledge Distillation for mathematical reasoningCode0
Markov Chain of Thought for Efficient Mathematical Reasoning0
Can Large Language Models Invent Algorithms to Improve Themselves?0
Keep Guessing? When Considering Inference Scaling, Mind the Baselines0
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology0
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning0
How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs0
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning0
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction0
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math ReasoningCode0
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs0
Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement0
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning PerspectiveCode0
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical ReasoningCode1
Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning0
Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning0
HARDMath: A Benchmark Dataset for Challenging Problems in Applied MathematicsCode1
A Systematic Survey on Large Language Models for Algorithm Design0
SuperCorrect: Supervising and Correcting Language Models with Error-Driven InsightsCode4
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees0
Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks0
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language ModelsCode2
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical CodeCode2
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language ModelsCode0
VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers0
Herald: A Natural Language Annotated Lean 4 Dataset0
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning0
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness0
Subtle Errors Matter: Preference Learning via Error-injected Self-editing0
FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning0
LeanAgent: Lifelong Learning for Formal Theorem ProvingCode2
Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning0
Give me a hint: Can LLMs take a hint to solve math problems?Code0
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs0
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language ModelsCode1
Polymath: A Challenging Multi-modal Mathematical Reasoning BenchmarkCode0
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection0
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-ImprovementCode2
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable QuestionsCode0
Table Question Answering for Low-resourced Indic LanguagesCode0
Show:102550
← PrevPage 9 of 17Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1XolverAcc94.4Unverified
2DeepSeek-r1Acc79.8Unverified
3Openai-o1Acc74.4Unverified
4Openai-o1-miniAcc70Unverified
5s1-32BAcc56.7Unverified
6Search-o1Acc56.7Unverified
7Openai-o1-previewAcc44.6Unverified
8Qwen2.5-72B-InstructAcc23.3Unverified
9Claude3.5-SonnetAcc16Unverified
#ModelMetricClaimedVerifiedStatus
1o3Accuracy0.25Unverified
2Gemini 1.5 Pro (002)Accuracy0.02Unverified
3o1-previewAccuracy0.01Unverified
4GPT-4oAccuracy0.01Unverified
5Claude 3.5 SonnetAccuracy0.01Unverified
6o1-miniAccuracy0.01Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.6Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.48Unverified
3Neo-P (Fine-tuned, 2.7B)Accuracy0.39Unverified
4GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
5Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.25Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.2Unverified
#ModelMetricClaimedVerifiedStatus
1Codex (Few-Shot, 175B)Accuracy0.59Unverified
2Bhāskara-P (Fine-tuned, 2.7B)Accuracy0.45Unverified
3GPT-3 (Few-Shot, 175B)Accuracy0.38Unverified
4Bhāskara-A (Fine-tuned, 2.7B)Accuracy0.27Unverified
5Neo-P (Fine-tuned, 2.7B)Accuracy0.24Unverified
6Neo-A (Fine-tuned, 2.7B)Accuracy0.18Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDCompletion accuracy65.8Unverified
2PGPSNetCompletion accuracy62.7Unverified
3GAPSCompletion accuracy61.2Unverified
4Inter-GPSCompletion accuracy59.8Unverified
5GeoformerCompletion accuracy35.6Unverified
6NGSCompletion accuracy34.1Unverified
#ModelMetricClaimedVerifiedStatus
1QWQ-32B-previewAcc82.5Unverified
2Math-MasterAcc82Unverified
3Qwen2.5-Math-7B-instructAcc62.5Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)75.2Unverified
2GAPSAccuracy (%)67.8Unverified
#ModelMetricClaimedVerifiedStatus
1Search-o1Acc86.4Unverified
#ModelMetricClaimedVerifiedStatus
1GOLDAccuracy (%)98.5Unverified
#ModelMetricClaimedVerifiedStatus
1GAPSAccuracy (%)97.5Unverified